A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# to build linear regression_model
from sklearn.linear_model import LinearRegression
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# to build linear regression_model using statsmodels
import statsmodels.api as sm
# To tune different models
from sklearn.model_selection import GridSearchCV
# To perform statistical analysis
import scipy.stats as stats
from IPython.display import Image
from os import system
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
plot_confusion_matrix,
make_scorer,
)
# Reading the datafile and viewing the first 5 rows of head and last 5 rows of tail.
df = pd.read_csv("INNHotelsGroup.csv")
data = df.copy()
data.head()
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | INN00001 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00 | 0 | Not_Canceled |
| 1 | INN00002 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68 | 1 | Not_Canceled |
| 2 | INN00003 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00 | 0 | Canceled |
| 3 | INN00004 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00 | 0 | Canceled |
| 4 | INN00005 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled |
data.tail()
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 36270 | INN36271 | 3 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 4 | 85 | 2018 | 8 | 3 | Online | 0 | 0 | 0 | 167.80 | 1 | Not_Canceled |
| 36271 | INN36272 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 228 | 2018 | 10 | 17 | Online | 0 | 0 | 0 | 90.95 | 2 | Canceled |
| 36272 | INN36273 | 2 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 1 | 148 | 2018 | 7 | 1 | Online | 0 | 0 | 0 | 98.39 | 2 | Not_Canceled |
| 36273 | INN36274 | 2 | 0 | 0 | 3 | Not Selected | 0 | Room_Type 1 | 63 | 2018 | 4 | 21 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled |
| 36274 | INN36275 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 207 | 2018 | 12 | 30 | Offline | 0 | 0 | 0 | 161.67 | 0 | Not_Canceled |
data.shape
(36275, 19)
data[data.duplicated()].count()
Booking_ID 0 no_of_adults 0 no_of_children 0 no_of_weekend_nights 0 no_of_week_nights 0 type_of_meal_plan 0 required_car_parking_space 0 room_type_reserved 0 lead_time 0 arrival_year 0 arrival_month 0 arrival_date 0 market_segment_type 0 repeated_guest 0 no_of_previous_cancellations 0 no_of_previous_bookings_not_canceled 0 avg_price_per_room 0 no_of_special_requests 0 booking_status 0 dtype: int64
data.drop_duplicates(inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 36275 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Booking_ID 36275 non-null object 1 no_of_adults 36275 non-null int64 2 no_of_children 36275 non-null int64 3 no_of_weekend_nights 36275 non-null int64 4 no_of_week_nights 36275 non-null int64 5 type_of_meal_plan 36275 non-null object 6 required_car_parking_space 36275 non-null int64 7 room_type_reserved 36275 non-null object 8 lead_time 36275 non-null int64 9 arrival_year 36275 non-null int64 10 arrival_month 36275 non-null int64 11 arrival_date 36275 non-null int64 12 market_segment_type 36275 non-null object 13 repeated_guest 36275 non-null int64 14 no_of_previous_cancellations 36275 non-null int64 15 no_of_previous_bookings_not_canceled 36275 non-null int64 16 avg_price_per_room 36275 non-null float64 17 no_of_special_requests 36275 non-null int64 18 booking_status 36275 non-null object dtypes: float64(1), int64(13), object(5) memory usage: 5.5+ MB
# DRopping the Booking_ID
data["Booking_ID"].nunique()
36275
data.drop(["Booking_ID"], axis=1, inplace=True)
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36275.0 | 1.844962 | 0.518715 | 0.0 | 2.0 | 2.00 | 2.0 | 4.0 |
| no_of_children | 36275.0 | 0.105279 | 0.402648 | 0.0 | 0.0 | 0.00 | 0.0 | 10.0 |
| no_of_weekend_nights | 36275.0 | 0.810724 | 0.870644 | 0.0 | 0.0 | 1.00 | 2.0 | 7.0 |
| no_of_week_nights | 36275.0 | 2.204300 | 1.410905 | 0.0 | 1.0 | 2.00 | 3.0 | 17.0 |
| required_car_parking_space | 36275.0 | 0.030986 | 0.173281 | 0.0 | 0.0 | 0.00 | 0.0 | 1.0 |
| lead_time | 36275.0 | 85.232557 | 85.930817 | 0.0 | 17.0 | 57.00 | 126.0 | 443.0 |
| arrival_year | 36275.0 | 2017.820427 | 0.383836 | 2017.0 | 2018.0 | 2018.00 | 2018.0 | 2018.0 |
| arrival_month | 36275.0 | 7.423653 | 3.069894 | 1.0 | 5.0 | 8.00 | 10.0 | 12.0 |
| arrival_date | 36275.0 | 15.596995 | 8.740447 | 1.0 | 8.0 | 16.00 | 23.0 | 31.0 |
| repeated_guest | 36275.0 | 0.025637 | 0.158053 | 0.0 | 0.0 | 0.00 | 0.0 | 1.0 |
| no_of_previous_cancellations | 36275.0 | 0.023349 | 0.368331 | 0.0 | 0.0 | 0.00 | 0.0 | 13.0 |
| no_of_previous_bookings_not_canceled | 36275.0 | 0.153411 | 1.754171 | 0.0 | 0.0 | 0.00 | 0.0 | 58.0 |
| avg_price_per_room | 36275.0 | 103.423539 | 35.089424 | 0.0 | 80.3 | 99.45 | 120.0 | 540.0 |
| no_of_special_requests | 36275.0 | 0.619655 | 0.786236 | 0.0 | 0.0 | 0.00 | 1.0 | 5.0 |
# Printing the all the column names
print(data.columns)
Index(['no_of_adults', 'no_of_children', 'no_of_weekend_nights',
'no_of_week_nights', 'type_of_meal_plan', 'required_car_parking_space',
'room_type_reserved', 'lead_time', 'arrival_year', 'arrival_month',
'arrival_date', 'market_segment_type', 'repeated_guest',
'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled',
'avg_price_per_room', 'no_of_special_requests', 'booking_status'],
dtype='object')
cat_columns = ["type_of_meal_plan", "room_type_reserved", "market_segment_type"]
for i in cat_columns:
print(data[i].value_counts())
print("*" * 50)
Meal Plan 1 27835 Not Selected 5130 Meal Plan 2 3305 Meal Plan 3 5 Name: type_of_meal_plan, dtype: int64 ************************************************** Room_Type 1 28130 Room_Type 4 6057 Room_Type 6 966 Room_Type 2 692 Room_Type 5 265 Room_Type 7 158 Room_Type 3 7 Name: room_type_reserved, dtype: int64 ************************************************** Online 23214 Offline 10528 Corporate 2017 Complementary 391 Aviation 125 Name: market_segment_type, dtype: int64 **************************************************
Questions:
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(data, "arrival_month", perc=True)
labeled_barplot(data, "market_segment_type", perc=True)
data["total_stay"] = data["no_of_week_nights"] + data["no_of_weekend_nights"]
plt.figure(figsize=(12, 6))
sns.barplot(x="market_segment_type", y="total_stay", data=data)
plt.title("Total nights spent by guest by market segment", weight="bold")
plt.xlabel("market_segment_type")
plt.ylabel("Number of days")
Text(0, 0.5, 'Number of days')
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# due to many prices, let's bin the costs
data["room_cost_cat"] = pd.cut(
x=data.avg_price_per_room,
bins=[-np.infty, 80, 120, np.infty],
labels=["Standard", "Moderate", "Premium"],
)
data["room_cost_cat"].value_counts()
# Standard = 0 to <80
# Moderate =80 to <120
# Premium >=120
Moderate 18280 Premium 9058 Standard 8937 Name: room_cost_cat, dtype: int64
# room_cost_cat vs market_segment_type
stacked_barplot(data, "room_cost_cat", "market_segment_type")
market_segment_type Aviation Complementary Corporate Offline Online \ room_cost_cat All 125 391 2017 10528 23214 Moderate 121 0 701 5957 11501 Standard 4 389 1166 3826 3552 Premium 0 2 150 745 8161 market_segment_type All room_cost_cat All 36275 Moderate 18280 Standard 8937 Premium 9058 ------------------------------------------------------------------------------------------------------------------------
labeled_barplot(data, "booking_status", perc=True)
plt.figure(figsize=(12, 6))
sns.barplot(x="arrival_year", y="lead_time", hue="booking_status", data=data)
plt.title("Arrival year with lead time", weight="bold")
plt.xlabel("Arrival year")
plt.ylabel("Lead time")
Text(0, 0.5, 'Lead time')
labeled_barplot(data, "repeated_guest", perc=True)
# market_segment_type vs repeated_guest
stacked_barplot(data, "market_segment_type", "repeated_guest")
repeated_guest 0 1 All market_segment_type All 35345 930 36275 Corporate 1415 602 2017 Complementary 265 126 391 Online 23118 96 23214 Offline 10438 90 10528 Aviation 109 16 125 ------------------------------------------------------------------------------------------------------------------------
# room_cost_cat vs repeated_guest
stacked_barplot(data, "room_cost_cat", "repeated_guest")
repeated_guest 0 1 All room_cost_cat All 35345 930 36275 Standard 8227 710 8937 Moderate 18082 198 18280 Premium 9036 22 9058 ------------------------------------------------------------------------------------------------------------------------
# required_car_parking_space vs repeated_guest
stacked_barplot(data, "required_car_parking_space", "repeated_guest")
repeated_guest 0 1 All required_car_parking_space All 35345 930 36275 0 34360 791 35151 1 985 139 1124 ------------------------------------------------------------------------------------------------------------------------
# no_special_requests vs booking_status
stacked_barplot(data, "no_of_special_requests", "booking_status")
booking_status Canceled Not_Canceled All no_of_special_requests All 11885 24390 36275 0 8545 11232 19777 1 2703 8670 11373 2 637 3727 4364 3 0 675 675 4 0 78 78 5 0 8 8 ------------------------------------------------------------------------------------------------------------------------
# required_car_parking_space vs booking_status
stacked_barplot(data, "required_car_parking_space", "booking_status")
booking_status Canceled Not_Canceled All required_car_parking_space All 11885 24390 36275 0 11771 23380 35151 1 114 1010 1124 ------------------------------------------------------------------------------------------------------------------------
# type_of_meal_plan vs booking_status
stacked_barplot(data, "type_of_meal_plan", "booking_status")
booking_status Canceled Not_Canceled All type_of_meal_plan All 11885 24390 36275 Meal Plan 1 8679 19156 27835 Not Selected 1699 3431 5130 Meal Plan 2 1506 1799 3305 Meal Plan 3 1 4 5 ------------------------------------------------------------------------------------------------------------------------
# room_type_reserved vs booking_status
stacked_barplot(data, "room_type_reserved", "booking_status")
booking_status Canceled Not_Canceled All room_type_reserved All 11885 24390 36275 Room_Type 1 9072 19058 28130 Room_Type 4 2069 3988 6057 Room_Type 6 406 560 966 Room_Type 2 228 464 692 Room_Type 5 72 193 265 Room_Type 7 36 122 158 Room_Type 3 2 5 7 ------------------------------------------------------------------------------------------------------------------------
# Observation of no_of_adults
labeled_barplot(data, "no_of_adults", perc=True)
# Observation on no_of_children
labeled_barplot(data, "no_of_children", perc=True)
# Observation on no_of_weekend_nights
labeled_barplot(data, "no_of_weekend_nights", perc=True)
# Observation on no_of_week_nights
labeled_barplot(data, "no_of_week_nights", perc=True)
# Observation on arrival_year
labeled_barplot(data, "arrival_year", perc=True)
# Observation on arrival_date
labeled_barplot(data, "arrival_date", perc=True)
# Observation on no_of_previous_cancellations
labeled_barplot(data, "no_of_previous_cancellations", perc=True)
# Observation on no_of_special_requests
labeled_barplot(data, "no_of_special_requests", perc=True)
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# Observation on avg_price_per_room
histogram_boxplot(data, "avg_price_per_room", bins=100)
plt.figure(figsize=(20, 10))
bn = sns.boxplot(x="booking_status", y="lead_time", data=data, palette="PuBu")
plt.xticks(rotation=60)
plt.show()
# due to many lead time, let's bin the lead time
data["lead_time_bins"] = pd.cut(
x=data.lead_time,
bins=[-np.infty, 17, 57, 126, np.infty],
labels=["Short", "Moderate", "High", "Extreme"],
)
data["lead_time_bins"].value_counts()
# Short 25th = 0 to <17
# Moderate 50th = 17 to <57
# High 75th = 57 to to <126
# Extreme = 126 to infinity
Short 9226 Extreme 9058 High 9015 Moderate 8976 Name: lead_time_bins, dtype: int64
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
sns.pairplot(data, diag_kind="kde")
<seaborn.axisgrid.PairGrid at 0x2a5f6f569a0>
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
distribution_plot_wrt_target(data, "no_of_weekend_nights", "booking_status")
distribution_plot_wrt_target(data, "no_of_week_nights", "booking_status")
distribution_plot_wrt_target(data, "lead_time", "booking_status")
distribution_plot_wrt_target(data, "arrival_month", "booking_status")
distribution_plot_wrt_target(data, "avg_price_per_room", "booking_status")
distribution_plot_wrt_target(data, "no_of_special_requests", "booking_status")
data.isnull().sum().sort_values(ascending=False)
no_of_adults 0 market_segment_type 0 room_cost_cat 0 total_stay 0 booking_status 0 no_of_special_requests 0 avg_price_per_room 0 no_of_previous_bookings_not_canceled 0 no_of_previous_cancellations 0 repeated_guest 0 arrival_date 0 no_of_children 0 arrival_month 0 arrival_year 0 lead_time 0 room_type_reserved 0 required_car_parking_space 0 type_of_meal_plan 0 no_of_week_nights 0 no_of_weekend_nights 0 lead_time_bins 0 dtype: int64
df1 = data.copy()
cat_feature = [feature for feature in df1.columns if df1[feature].dtype == "object"]
print("Number of Categorical Features are :", len(cat_feature))
Number of Categorical Features are : 4
df1[cat_feature][:5]
| type_of_meal_plan | room_type_reserved | market_segment_type | booking_status | |
|---|---|---|---|---|
| 0 | Meal Plan 1 | Room_Type 1 | Offline | Not_Canceled |
| 1 | Not Selected | Room_Type 1 | Online | Not_Canceled |
| 2 | Meal Plan 1 | Room_Type 1 | Online | Canceled |
| 3 | Meal Plan 1 | Room_Type 1 | Online | Canceled |
| 4 | Not Selected | Room_Type 1 | Online | Canceled |
for feature in cat_feature:
print("{} : {}".format(feature, len(df1[feature].unique())))
type_of_meal_plan : 4 room_type_reserved : 7 market_segment_type : 5 booking_status : 2
df1["type_of_meal_plan"].replace("Undefined", "SC", inplace=True)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df1['type_of_meal_plan'] = le.fit_transform(df1['type_of_meal_plan'])
df1['room_type_reserved'] = le.fit_transform(df1['room_type_reserved'])
df1['market_segment_type'] = le.fit_transform(df1['market_segment_type'])
df1.head()
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | total_stay | room_cost_cat | lead_time_bins | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | 0 | 0 | 0 | 224 | 2017 | 10 | 2 | 3 | 0 | 0 | 0 | 65.00 | 0 | Not_Canceled | 3 | Standard | Extreme |
| 1 | 2 | 0 | 2 | 3 | 3 | 0 | 0 | 5 | 2018 | 11 | 6 | 4 | 0 | 0 | 0 | 106.68 | 1 | Not_Canceled | 5 | Moderate | Short |
| 2 | 1 | 0 | 2 | 1 | 0 | 0 | 0 | 1 | 2018 | 2 | 28 | 4 | 0 | 0 | 0 | 60.00 | 0 | Canceled | 3 | Standard | Short |
| 3 | 2 | 0 | 0 | 2 | 0 | 0 | 0 | 211 | 2018 | 5 | 20 | 4 | 0 | 0 | 0 | 100.00 | 0 | Canceled | 2 | Moderate | Extreme |
| 4 | 2 | 0 | 1 | 1 | 3 | 0 | 0 | 48 | 2018 | 4 | 11 | 4 | 0 | 0 | 0 | 94.50 | 0 | Canceled | 2 | Moderate | Moderate |
# let's plot the boxplots of all numerical columns to check for outliers
plt.figure(figsize=(20, 30))
for i, var in enumerate(data.select_dtypes(include=np.number).columns.tolist()):
plt.subplot(5, 4, i + 1)
plt.boxplot(data[var], whis=1.5)
plt.tight_layout()
plt.title(var)
plt.show()
There are a lot of outliers, treating it would lead to loss of data. ## EDA
It is a good idea to explore the data once again after manipulating it.
for col in ["avg_price_per_room", "lead_time"]:
histogram_boxplot(data, col)
# Looking at the correlations
numeric_columns = df1.select_dtypes(include=np.number).columns.tolist()
# correlation heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(
df1[numeric_columns].corr(),
annot=True,
vmin=-1,
vmax=1,
fmt=".2f",
cmap="Spectral",
)
# plt.show()
plt.savefig("heat_map", dpi=300, bbox_inches="tight")
# Encoding our dependent variable booking_status = Not_Canceled as 0 and Canceled as 1
df1["booking_status"] = df1["booking_status"].apply(
lambda x: 1 if x == "Canceled" else 0
)
# defining X and y variables
X = df1.drop(["booking_status"], axis=1)
y = df1["booking_status"]
# view independent variables
X.head()
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | total_stay | room_cost_cat | lead_time_bins | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | 0 | 0 | 0 | 224 | 2017 | 10 | 2 | 3 | 0 | 0 | 0 | 65.00 | 0 | 3 | Standard | Extreme |
| 1 | 2 | 0 | 2 | 3 | 3 | 0 | 0 | 5 | 2018 | 11 | 6 | 4 | 0 | 0 | 0 | 106.68 | 1 | 5 | Moderate | Short |
| 2 | 1 | 0 | 2 | 1 | 0 | 0 | 0 | 1 | 2018 | 2 | 28 | 4 | 0 | 0 | 0 | 60.00 | 0 | 3 | Standard | Short |
| 3 | 2 | 0 | 0 | 2 | 0 | 0 | 0 | 211 | 2018 | 5 | 20 | 4 | 0 | 0 | 0 | 100.00 | 0 | 2 | Moderate | Extreme |
| 4 | 2 | 0 | 1 | 1 | 3 | 0 | 0 | 48 | 2018 | 4 | 11 | 4 | 0 | 0 | 0 | 94.50 | 0 | 2 | Moderate | Moderate |
# view dependent variables
y.head()
0 0 1 0 2 1 3 1 4 1 Name: booking_status, dtype: int64
**We want to predict the booking status. Thus booking_status is the dependent variable.
We'll split the data into train and test to be able to evaluate the model that we build on the train data.
We will build a Linear Regression model using the train data and then check it's performance.
# encoding the categorical variables
X = pd.get_dummies(X, drop_first=True)
X.head()
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | total_stay | room_cost_cat_Moderate | room_cost_cat_Premium | lead_time_bins_Moderate | lead_time_bins_High | lead_time_bins_Extreme | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | 0 | 0 | 0 | 224 | 2017 | 10 | 2 | 3 | 0 | 0 | 0 | 65.00 | 0 | 3 | 0 | 0 | 0 | 0 | 1 |
| 1 | 2 | 0 | 2 | 3 | 3 | 0 | 0 | 5 | 2018 | 11 | 6 | 4 | 0 | 0 | 0 | 106.68 | 1 | 5 | 1 | 0 | 0 | 0 | 0 |
| 2 | 1 | 0 | 2 | 1 | 0 | 0 | 0 | 1 | 2018 | 2 | 28 | 4 | 0 | 0 | 0 | 60.00 | 0 | 3 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2 | 0 | 0 | 2 | 0 | 0 | 0 | 211 | 2018 | 5 | 20 | 4 | 0 | 0 | 0 | 100.00 | 0 | 2 | 1 | 0 | 0 | 0 | 1 |
| 4 | 2 | 0 | 1 | 1 | 3 | 0 | 0 | 48 | 2018 | 4 | 11 | 4 | 0 | 0 | 0 | 94.50 | 0 | 2 | 1 | 0 | 1 | 0 | 0 |
# splitting the data in 70:30 ratio for train to test data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=1
)
# check shape of the train and test data
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 25392 Number of rows in test data = 10883
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Percentage of classes in training set: 0 0.670644 1 0.329356 Name: booking_status, dtype: float64 Percentage of classes in test set: 0 0.676376 1 0.323624 Name: booking_status, dtype: float64
# To measure of colinearity among predictor variables within a multiple regression
from statsmodels.stats.outliers_influence import variance_inflation_factor
# we will define a function to check VIF
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
# Check VIF (Variance Inflation Factor) in the Training data.
checking_vif(X_train)
| feature | VIF | |
|---|---|---|
| 0 | no_of_adults | 18.036707 |
| 1 | no_of_children | 1.365725 |
| 2 | no_of_weekend_nights | inf |
| 3 | no_of_week_nights | inf |
| 4 | type_of_meal_plan | 1.423465 |
| 5 | required_car_parking_space | 1.070450 |
| 6 | room_type_reserved | 1.979669 |
| 7 | lead_time | 12.238806 |
| 8 | arrival_year | 61.346886 |
| 9 | arrival_month | 7.222832 |
| 10 | arrival_date | 4.216200 |
| 11 | market_segment_type | 44.957359 |
| 12 | repeated_guest | 1.719609 |
| 13 | no_of_previous_cancellations | 1.391234 |
| 14 | no_of_previous_bookings_not_canceled | 1.651631 |
| 15 | avg_price_per_room | 39.726701 |
| 16 | no_of_special_requests | 1.935322 |
| 17 | total_stay | inf |
| 18 | room_cost_cat_Moderate | 5.097389 |
| 19 | room_cost_cat_Premium | 7.549074 |
| 20 | lead_time_bins_Moderate | 2.298891 |
| 21 | lead_time_bins_High | 3.583085 |
| 22 | lead_time_bins_Extreme | 10.840082 |
# VIF in the Training data.
vif_series = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: no_of_adults 18.036707 no_of_children 1.365725 no_of_weekend_nights inf no_of_week_nights inf type_of_meal_plan 1.423465 required_car_parking_space 1.070450 room_type_reserved 1.979669 lead_time 12.238806 arrival_year 61.346886 arrival_month 7.222832 arrival_date 4.216200 market_segment_type 44.957359 repeated_guest 1.719609 no_of_previous_cancellations 1.391234 no_of_previous_bookings_not_canceled 1.651631 avg_price_per_room 39.726701 no_of_special_requests 1.935322 total_stay inf room_cost_cat_Moderate 5.097389 room_cost_cat_Premium 7.549074 lead_time_bins_Moderate 2.298891 lead_time_bins_High 3.583085 lead_time_bins_Extreme 10.840082 dtype: float64
# Removing no_of_weekend_nights in the Training data.
X_train1 = X_train.drop("no_of_weekend_nights", axis=1)
vif_series2 = pd.Series(
[variance_inflation_factor(X_train1.values, i) for i in range(X_train1.shape[1])],
index=X_train1.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series2))
Series before feature selection: no_of_adults 18.036707 no_of_children 1.365725 no_of_week_nights 15.093537 type_of_meal_plan 1.423465 required_car_parking_space 1.070450 room_type_reserved 1.979669 lead_time 12.238806 arrival_year 61.346886 arrival_month 7.222832 arrival_date 4.216200 market_segment_type 44.957359 repeated_guest 1.719609 no_of_previous_cancellations 1.391234 no_of_previous_bookings_not_canceled 1.651631 avg_price_per_room 39.726701 no_of_special_requests 1.935322 total_stay 17.271696 room_cost_cat_Moderate 5.097389 room_cost_cat_Premium 7.549074 lead_time_bins_Moderate 2.298891 lead_time_bins_High 3.583085 lead_time_bins_Extreme 10.840082 dtype: float64
# Removing arrival_years in the Training data.
X_train2 = X_train1.drop("arrival_year", axis=1)
vif_series3 = pd.Series(
[variance_inflation_factor(X_train2.values, i) for i in range(X_train2.shape[1])],
index=X_train2.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series3))
Series before feature selection: no_of_adults 16.546414 no_of_children 1.363514 no_of_week_nights 15.047937 type_of_meal_plan 1.407918 required_car_parking_space 1.070322 room_type_reserved 1.944986 lead_time 12.228958 arrival_month 6.512927 arrival_date 3.974780 market_segment_type 28.792905 repeated_guest 1.582089 no_of_previous_cancellations 1.385231 no_of_previous_bookings_not_canceled 1.644854 avg_price_per_room 34.927774 no_of_special_requests 1.873345 total_stay 17.271693 room_cost_cat_Moderate 4.969457 room_cost_cat_Premium 6.742555 lead_time_bins_Moderate 2.297213 lead_time_bins_High 3.582302 lead_time_bins_Extreme 10.834659 dtype: float64
# Removing avg_price_per_room in the Training data.
X_train3 = X_train2.drop("avg_price_per_room", axis=1)
vif_series4 = pd.Series(
[variance_inflation_factor(X_train3.values, i) for i in range(X_train3.shape[1])],
index=X_train3.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series4))
Series before feature selection: no_of_adults 16.031679 no_of_children 1.301128 no_of_week_nights 15.033036 type_of_meal_plan 1.406625 required_car_parking_space 1.068198 room_type_reserved 1.907824 lead_time 12.152134 arrival_month 6.316914 arrival_date 3.937829 market_segment_type 23.535193 repeated_guest 1.571968 no_of_previous_cancellations 1.384581 no_of_previous_bookings_not_canceled 1.644349 no_of_special_requests 1.841793 total_stay 17.259769 room_cost_cat_Moderate 3.304519 room_cost_cat_Premium 2.908484 lead_time_bins_Moderate 2.296613 lead_time_bins_High 3.578358 lead_time_bins_Extreme 10.791826 dtype: float64
# Removing market_segment_type in the Training data.
X_train4 = X_train3.drop("market_segment_type", axis=1)
vif_series5 = pd.Series(
[variance_inflation_factor(X_train4.values, i) for i in range(X_train4.shape[1])],
index=X_train4.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series5))
Series before feature selection: no_of_adults 11.611740 no_of_children 1.278557 no_of_week_nights 15.032169 type_of_meal_plan 1.331180 required_car_parking_space 1.067521 room_type_reserved 1.896394 lead_time 12.134421 arrival_month 5.840383 arrival_date 3.713740 repeated_guest 1.570704 no_of_previous_cancellations 1.381205 no_of_previous_bookings_not_canceled 1.643495 no_of_special_requests 1.803720 total_stay 17.035549 room_cost_cat_Moderate 2.986281 room_cost_cat_Premium 2.713405 lead_time_bins_Moderate 2.234508 lead_time_bins_High 3.510377 lead_time_bins_Extreme 10.736847 dtype: float64
# Removing total_stay in the Training data.
X_train5 = X_train4.drop("total_stay", axis=1)
vif_series6 = pd.Series(
[variance_inflation_factor(X_train5.values, i) for i in range(X_train5.shape[1])],
index=X_train5.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series6))
Series before feature selection: no_of_adults 11.377685 no_of_children 1.277382 no_of_week_nights 3.496830 type_of_meal_plan 1.331010 required_car_parking_space 1.067192 room_type_reserved 1.895948 lead_time 12.107277 arrival_month 5.839443 arrival_date 3.702185 repeated_guest 1.570162 no_of_previous_cancellations 1.381201 no_of_previous_bookings_not_canceled 1.643197 no_of_special_requests 1.801129 room_cost_cat_Moderate 2.985887 room_cost_cat_Premium 2.707390 lead_time_bins_Moderate 2.228643 lead_time_bins_High 3.487384 lead_time_bins_Extreme 10.691622 dtype: float64
# Removing lead_time in the Training data.
X_train6 = X_train5.drop("lead_time", axis=1)
vif_series7 = pd.Series(
[variance_inflation_factor(X_train6.values, i) for i in range(X_train6.shape[1])],
index=X_train6.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series7))
Series before feature selection: no_of_adults 11.351718 no_of_children 1.277370 no_of_week_nights 3.495378 type_of_meal_plan 1.330344 required_car_parking_space 1.066836 room_type_reserved 1.892500 arrival_month 5.774710 arrival_date 3.700900 repeated_guest 1.569468 no_of_previous_cancellations 1.381198 no_of_previous_bookings_not_canceled 1.643187 no_of_special_requests 1.796855 room_cost_cat_Moderate 2.984662 room_cost_cat_Premium 2.691579 lead_time_bins_Moderate 2.054403 lead_time_bins_High 2.142817 lead_time_bins_Extreme 2.305310 dtype: float64
# Removing no_of_adults in the Training data.
X_train7 = X_train6.drop("no_of_adults", axis=1)
vif_series8 = pd.Series(
[variance_inflation_factor(X_train7.values, i) for i in range(X_train7.shape[1])],
index=X_train7.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series8))
Series before feature selection: no_of_children 1.250563 no_of_week_nights 3.319090 type_of_meal_plan 1.291326 required_car_parking_space 1.064421 room_type_reserved 1.827343 arrival_month 5.163814 arrival_date 3.382963 repeated_guest 1.568705 no_of_previous_cancellations 1.378597 no_of_previous_bookings_not_canceled 1.642505 no_of_special_requests 1.762772 room_cost_cat_Moderate 2.749816 room_cost_cat_Premium 2.404975 lead_time_bins_Moderate 1.887253 lead_time_bins_High 1.944412 lead_time_bins_Extreme 2.129918 dtype: float64
# Check VIF (Variance Inflation Factor) in the test data
checking_vif(X_test)
| feature | VIF | |
|---|---|---|
| 0 | no_of_adults | 17.980231 |
| 1 | no_of_children | 1.376285 |
| 2 | no_of_weekend_nights | inf |
| 3 | no_of_week_nights | inf |
| 4 | type_of_meal_plan | 1.433864 |
| 5 | required_car_parking_space | 1.056689 |
| 6 | room_type_reserved | 1.936043 |
| 7 | lead_time | 12.552430 |
| 8 | arrival_year | 62.051980 |
| 9 | arrival_month | 7.214151 |
| 10 | arrival_date | 4.199687 |
| 11 | market_segment_type | 44.121074 |
| 12 | repeated_guest | 1.695691 |
| 13 | no_of_previous_cancellations | 1.257421 |
| 14 | no_of_previous_bookings_not_canceled | 1.532717 |
| 15 | avg_price_per_room | 42.524806 |
| 16 | no_of_special_requests | 1.954332 |
| 17 | total_stay | inf |
| 18 | room_cost_cat_Moderate | 5.342367 |
| 19 | room_cost_cat_Premium | 7.703122 |
| 20 | lead_time_bins_Moderate | 2.275950 |
| 21 | lead_time_bins_High | 3.552319 |
| 22 | lead_time_bins_Extreme | 11.061391 |
# Removing no_of_weekend_nights in the test data
X_test1 = X_test.drop("no_of_weekend_nights", axis=1)
vif_series2 = pd.Series(
[variance_inflation_factor(X_test1.values, i) for i in range(X_test1.shape[1])],
index=X_test1.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series2))
Series before feature selection: no_of_adults 17.980231 no_of_children 1.376285 no_of_week_nights 15.190752 type_of_meal_plan 1.433864 required_car_parking_space 1.056689 room_type_reserved 1.936043 lead_time 12.552430 arrival_year 62.051980 arrival_month 7.214151 arrival_date 4.199687 market_segment_type 44.121074 repeated_guest 1.695691 no_of_previous_cancellations 1.257421 no_of_previous_bookings_not_canceled 1.532717 avg_price_per_room 42.524806 no_of_special_requests 1.954332 total_stay 17.317509 room_cost_cat_Moderate 5.342367 room_cost_cat_Premium 7.703122 lead_time_bins_Moderate 2.275950 lead_time_bins_High 3.552319 lead_time_bins_Extreme 11.061391 dtype: float64
# Removing arrival_year in the test data
X_test2 = X_test1.drop("arrival_year", axis=1)
vif_series3 = pd.Series(
[variance_inflation_factor(X_test2.values, i) for i in range(X_test2.shape[1])],
index=X_test2.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series3))
Series before feature selection: no_of_adults 16.437781 no_of_children 1.373334 no_of_week_nights 15.169232 type_of_meal_plan 1.419950 required_car_parking_space 1.056118 room_type_reserved 1.898490 lead_time 12.542572 arrival_month 6.555446 arrival_date 3.945641 market_segment_type 29.361799 repeated_guest 1.563292 no_of_previous_cancellations 1.253350 no_of_previous_bookings_not_canceled 1.526041 avg_price_per_room 36.593231 no_of_special_requests 1.886264 total_stay 17.312494 room_cost_cat_Moderate 5.203804 room_cost_cat_Premium 6.814725 lead_time_bins_Moderate 2.273085 lead_time_bins_High 3.551212 lead_time_bins_Extreme 11.056917 dtype: float64
# Removing avg_price_per_room in the test data
X_test3 = X_test2.drop("avg_price_per_room", axis=1)
vif_series4 = pd.Series(
[variance_inflation_factor(X_test3.values, i) for i in range(X_test3.shape[1])],
index=X_test3.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series4))
Series before feature selection: no_of_adults 15.988734 no_of_children 1.308068 no_of_week_nights 15.138783 type_of_meal_plan 1.416953 required_car_parking_space 1.055243 room_type_reserved 1.871664 lead_time 12.465557 arrival_month 6.344169 arrival_date 3.920301 market_segment_type 23.398933 repeated_guest 1.550647 no_of_previous_cancellations 1.252974 no_of_previous_bookings_not_canceled 1.525864 no_of_special_requests 1.853749 total_stay 17.294375 room_cost_cat_Moderate 3.353754 room_cost_cat_Premium 2.817079 lead_time_bins_Moderate 2.272895 lead_time_bins_High 3.545823 lead_time_bins_Extreme 11.012526 dtype: float64
# Removing market_segment_type in the test data
X_test4 = X_test3.drop("market_segment_type", axis=1)
vif_series5 = pd.Series(
[variance_inflation_factor(X_test4.values, i) for i in range(X_test4.shape[1])],
index=X_test4.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series5))
Series before feature selection: no_of_adults 11.564597 no_of_children 1.291275 no_of_week_nights 15.137060 type_of_meal_plan 1.334150 required_car_parking_space 1.054691 room_type_reserved 1.861873 lead_time 12.445560 arrival_month 5.889908 arrival_date 3.703538 repeated_guest 1.549773 no_of_previous_cancellations 1.249576 no_of_previous_bookings_not_canceled 1.525022 no_of_special_requests 1.815064 total_stay 16.999694 room_cost_cat_Moderate 3.012464 room_cost_cat_Premium 2.624507 lead_time_bins_Moderate 2.218483 lead_time_bins_High 3.480781 lead_time_bins_Extreme 10.954338 dtype: float64
# Removing total_stay in the test data
X_test5 = X_test4.drop("total_stay", axis=1)
vif_series6 = pd.Series(
[variance_inflation_factor(X_test5.values, i) for i in range(X_test5.shape[1])],
index=X_test5.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series6))
Series before feature selection: no_of_adults 11.362841 no_of_children 1.289859 no_of_week_nights 3.502134 type_of_meal_plan 1.334136 required_car_parking_space 1.054303 room_type_reserved 1.861234 lead_time 12.424250 arrival_month 5.885336 arrival_date 3.692779 repeated_guest 1.549446 no_of_previous_cancellations 1.249514 no_of_previous_bookings_not_canceled 1.524926 no_of_special_requests 1.809403 room_cost_cat_Moderate 3.010511 room_cost_cat_Premium 2.611177 lead_time_bins_Moderate 2.212646 lead_time_bins_High 3.460713 lead_time_bins_Extreme 10.920874 dtype: float64
# Removing lead_time in the test data
X_test6 = X_test5.drop("lead_time", axis=1)
vif_series7 = pd.Series(
[variance_inflation_factor(X_test6.values, i) for i in range(X_test6.shape[1])],
index=X_test6.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series7))
Series before feature selection: no_of_adults 11.337906 no_of_children 1.289851 no_of_week_nights 3.500280 type_of_meal_plan 1.333235 required_car_parking_space 1.053845 room_type_reserved 1.859141 arrival_month 5.821853 arrival_date 3.690839 repeated_guest 1.548699 no_of_previous_cancellations 1.249513 no_of_previous_bookings_not_canceled 1.524916 no_of_special_requests 1.802884 room_cost_cat_Moderate 3.009809 room_cost_cat_Premium 2.595890 lead_time_bins_Moderate 2.030791 lead_time_bins_High 2.089319 lead_time_bins_Extreme 2.266146 dtype: float64
# Removing no_of_adults in the test data
X_test7 = X_test6.drop("no_of_adults", axis=1)
vif_series8 = pd.Series(
[variance_inflation_factor(X_test7.values, i) for i in range(X_test7.shape[1])],
index=X_test7.columns,
)
print("Series before feature selection: \n\n{}\n".format(vif_series8))
Series before feature selection: no_of_children 1.275147 no_of_week_nights 3.317793 type_of_meal_plan 1.292619 required_car_parking_space 1.051838 room_type_reserved 1.796498 arrival_month 5.182582 arrival_date 3.358572 repeated_guest 1.548153 no_of_previous_cancellations 1.248617 no_of_previous_bookings_not_canceled 1.524520 no_of_special_requests 1.762238 room_cost_cat_Moderate 2.789726 room_cost_cat_Premium 2.349664 lead_time_bins_Moderate 1.873728 lead_time_bins_High 1.903979 lead_time_bins_Extreme 2.098776 dtype: float64
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# There are different solvers available in Sklearn logistic regression
# The newton-cg solver is faster for high-dimensional data
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(solver="newton-cg", random_state=1)
model = lg.fit(X_train7, y_train)
# Checking model performance on training set
# predicting on training set
y_pred_train = lg.predict(X_train7)
print("Training set performance:")
print("Accuracy:", accuracy_score(y_train, y_pred_train))
print("Precision:", precision_score(y_train, y_pred_train))
print("Recall:", recall_score(y_train, y_pred_train))
print("F1:", f1_score(y_train, y_pred_train))
Training set performance: Accuracy: 0.7666194076874606 Precision: 0.6817842756974489 Recall: 0.5464546215472916 F1: 0.6066640116819328
# Checking performance on test set
y_pred_test = lg.predict(X_test7)
print("Test set performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_test))
print("Precision:", precision_score(y_test, y_pred_test))
print("Recall:", recall_score(y_test, y_pred_test))
print("F1:", f1_score(y_test, y_pred_test))
Test set performance: Accuracy: 0.7746944776256547 Precision: 0.6907988587731811 Recall: 0.5499716070414538 F1: 0.6123932975023713
We have build a logistic regression model which shows good performance on the train and test sets but to identify significant variables we will have to build a logistic regression model using the statsmodels library.
We will now perform logistic regression using statsmodels, a Python module that provides functions for the estimation of many statistical models, as well as for conducting statistical tests, and statistical data exploration.
Using statsmodels, we will be able to check the statistical validity of our model - identify the significant predictors from p-values that we get for each predictor variable.
# fitting logistic regression model
logit = sm.Logit(y_train, X_train7.astype(float))
lg = logit.fit(disp=False)
# setting disp=False will remove the information on number of iterations
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25376
Method: MLE Df Model: 15
Date: Sat, 20 Nov 2021 Pseudo R-squ.: 0.2094
Time: 03:41:57 Log-Likelihood: -12723.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
no_of_children 0.2406 0.040 5.993 0.000 0.162 0.319
no_of_week_nights -0.0506 0.011 -4.700 0.000 -0.072 -0.029
type_of_meal_plan 0.1575 0.015 10.499 0.000 0.128 0.187
required_car_parking_space -1.4314 0.132 -10.882 0.000 -1.689 -1.174
room_type_reserved 0.0334 0.013 2.478 0.013 0.007 0.060
arrival_month -0.1378 0.005 -29.434 0.000 -0.147 -0.129
arrival_date -0.0234 0.002 -14.653 0.000 -0.027 -0.020
repeated_guest -2.4280 0.490 -4.952 0.000 -3.389 -1.467
no_of_previous_cancellations 0.2001 0.088 2.281 0.023 0.028 0.372
no_of_previous_bookings_not_canceled -0.3057 0.206 -1.480 0.139 -0.710 0.099
no_of_special_requests -1.0625 0.025 -42.233 0.000 -1.112 -1.013
room_cost_cat_Moderate 0.2399 0.036 6.591 0.000 0.169 0.311
room_cost_cat_Premium 1.0492 0.049 21.300 0.000 0.953 1.146
lead_time_bins_Moderate 0.1800 0.045 4.010 0.000 0.092 0.268
lead_time_bins_High 0.7892 0.043 18.158 0.000 0.704 0.874
lead_time_bins_Extreme 2.4119 0.048 50.270 0.000 2.318 2.506
========================================================================================================
X_train7 = X_train7.drop(["no_of_previous_bookings_not_canceled"], axis=1)
logit = sm.Logit(y_train, X_train7.astype(float))
lg = logit.fit(disp=False)
# setting disp=False will remove the information on number of iterations
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25377
Method: MLE Df Model: 14
Date: Sat, 20 Nov 2021 Pseudo R-squ.: 0.2092
Time: 03:41:57 Log-Likelihood: -12725.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
================================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------------
no_of_children 0.2407 0.040 5.997 0.000 0.162 0.319
no_of_week_nights -0.0507 0.011 -4.707 0.000 -0.072 -0.030
type_of_meal_plan 0.1575 0.015 10.497 0.000 0.128 0.187
required_car_parking_space -1.4309 0.132 -10.879 0.000 -1.689 -1.173
room_type_reserved 0.0333 0.013 2.468 0.014 0.007 0.060
arrival_month -0.1378 0.005 -29.428 0.000 -0.147 -0.129
arrival_date -0.0234 0.002 -14.686 0.000 -0.027 -0.020
repeated_guest -3.0714 0.422 -7.278 0.000 -3.898 -2.244
no_of_previous_cancellations 0.1589 0.073 2.181 0.029 0.016 0.302
no_of_special_requests -1.0632 0.025 -42.261 0.000 -1.113 -1.014
room_cost_cat_Moderate 0.2404 0.036 6.604 0.000 0.169 0.312
room_cost_cat_Premium 1.0502 0.049 21.317 0.000 0.954 1.147
lead_time_bins_Moderate 0.1810 0.045 4.034 0.000 0.093 0.269
lead_time_bins_High 0.7900 0.043 18.176 0.000 0.705 0.875
lead_time_bins_Extreme 2.4136 0.048 50.303 0.000 2.320 2.508
================================================================================================
print("Training performance:")
model_performance_classification_statsmodels(lg, X_train7, y_train)
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.75571 | 0.54777 | 0.654242 | 0.59629 |
# running a loop to drop variables with high p-value
# initial list of columns
cols = X_train7.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
X_train_aux = X_train7[cols]
# fitting the model
model = sm.Logit(y_train, X_train_aux).fit(disp=False)
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['no_of_children', 'no_of_week_nights', 'type_of_meal_plan', 'required_car_parking_space', 'room_type_reserved', 'arrival_month', 'arrival_date', 'repeated_guest', 'no_of_previous_cancellations', 'no_of_special_requests', 'room_cost_cat_Moderate', 'room_cost_cat_Premium', 'lead_time_bins_Moderate', 'lead_time_bins_High', 'lead_time_bins_Extreme']
X_train8 = X_train7[selected_features]
logit1 = sm.Logit(y_train, X_train7.astype(float))
lg1 = logit1.fit(disp=False)
print(lg1.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25377
Method: MLE Df Model: 14
Date: Sat, 20 Nov 2021 Pseudo R-squ.: 0.2092
Time: 03:41:58 Log-Likelihood: -12725.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
================================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------------
no_of_children 0.2407 0.040 5.997 0.000 0.162 0.319
no_of_week_nights -0.0507 0.011 -4.707 0.000 -0.072 -0.030
type_of_meal_plan 0.1575 0.015 10.497 0.000 0.128 0.187
required_car_parking_space -1.4309 0.132 -10.879 0.000 -1.689 -1.173
room_type_reserved 0.0333 0.013 2.468 0.014 0.007 0.060
arrival_month -0.1378 0.005 -29.428 0.000 -0.147 -0.129
arrival_date -0.0234 0.002 -14.686 0.000 -0.027 -0.020
repeated_guest -3.0714 0.422 -7.278 0.000 -3.898 -2.244
no_of_previous_cancellations 0.1589 0.073 2.181 0.029 0.016 0.302
no_of_special_requests -1.0632 0.025 -42.261 0.000 -1.113 -1.014
room_cost_cat_Moderate 0.2404 0.036 6.604 0.000 0.169 0.312
room_cost_cat_Premium 1.0502 0.049 21.317 0.000 0.954 1.147
lead_time_bins_Moderate 0.1810 0.045 4.034 0.000 0.093 0.269
lead_time_bins_High 0.7900 0.043 18.176 0.000 0.705 0.875
lead_time_bins_Extreme 2.4136 0.048 50.303 0.000 2.320 2.508
================================================================================================
Now no feature has p-value greater than 0.05, so we'll consider the features in X_train8 as the final ones and lg1 as final model.
# converting coefficients to odds
odds = np.exp(lg1.params)
# finding the percentage change
perc_change_odds = (np.exp(lg1.params) - 1) * 100
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train6.columns).T
| no_of_adults | no_of_children | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | no_of_special_requests | room_cost_cat_Moderate | room_cost_cat_Premium | lead_time_bins_Moderate | lead_time_bins_High | lead_time_bins_Extreme | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Odds | NaN | 1.272191 | 0.950593 | 1.170548 | 0.239098 | 1.033820 | 0.871273 | 0.976841 | 0.046356 | 1.172254 | NaN | 0.345344 | 1.271731 | 2.858081 | 1.198431 | 2.203402 | 11.174292 |
| Change_odd% | NaN | 27.219127 | -4.940695 | 17.054849 | -76.090225 | 3.382025 | -12.872746 | -2.315924 | -95.364370 | 17.225407 | NaN | -65.465632 | 27.173061 | 185.808096 | 19.843086 | 120.340186 | 1017.429209 |
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_train8, y_train)
log_reg_model_train_perf = model_performance_classification_statsmodels(
lg1, X_train8, y_train
)
print("Training performance:")
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.75571 | 0.54777 | 0.654242 | 0.59629 |
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
make_scorer,
)
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict(X_train8))
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train8))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train8))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.3670703502736609
# creating confusion matrix
confusion_matrix_statsmodels(
lg1, X_train8, y_train, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, X_train8, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.736767 | 0.731317 | 0.579551 | 0.646648 |
y_scores = lg1.predict(X_train8)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# setting the threshold
optimal_threshold_curve = 0.48
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_train8, y_train, threshold=optimal_threshold_curve)
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_train8, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.755868 | 0.582805 | 0.642669 | 0.611275 |
# training performance comparison
models_train_comp_data = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_data.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.76 Threshold",
"Logistic Regression-0.58 Threshold",
]
print("Training performance comparison:")
models_train_comp_data
Training performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.76 Threshold | Logistic Regression-0.58 Threshold | |
|---|---|---|---|
| Accuracy | 0.755710 | 0.736767 | 0.755868 |
| Recall | 0.547770 | 0.731317 | 0.582805 |
| Precision | 0.654242 | 0.579551 | 0.642669 |
| F1 | 0.596290 | 0.646648 | 0.611275 |
# fitting logistic regression model
logit = sm.Logit(y_test, X_test7.astype(float))
lg = logit.fit(disp=False)
# setting disp=False will remove the information on number of iterations
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 10883
Model: Logit Df Residuals: 10867
Method: MLE Df Model: 15
Date: Sat, 20 Nov 2021 Pseudo R-squ.: 0.2133
Time: 03:42:01 Log-Likelihood: -5390.2
converged: True LL-Null: -6851.6
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
no_of_children 0.1476 0.066 2.253 0.024 0.019 0.276
no_of_week_nights -0.0464 0.016 -2.815 0.005 -0.079 -0.014
type_of_meal_plan 0.1163 0.023 5.093 0.000 0.072 0.161
required_car_parking_space -1.3538 0.202 -6.693 0.000 -1.750 -0.957
room_type_reserved 0.0487 0.021 2.353 0.019 0.008 0.089
arrival_month -0.1348 0.007 -18.771 0.000 -0.149 -0.121
arrival_date -0.0221 0.002 -9.119 0.000 -0.027 -0.017
repeated_guest -2.5176 0.592 -4.255 0.000 -3.677 -1.358
no_of_previous_cancellations 0.1417 0.106 1.337 0.181 -0.066 0.349
no_of_previous_bookings_not_canceled -0.0792 0.130 -0.610 0.542 -0.334 0.175
no_of_special_requests -1.1241 0.039 -28.498 0.000 -1.201 -1.047
room_cost_cat_Moderate 0.2218 0.056 3.980 0.000 0.113 0.331
room_cost_cat_Premium 0.9682 0.076 12.660 0.000 0.818 1.118
lead_time_bins_Moderate 0.2091 0.069 3.050 0.002 0.075 0.343
lead_time_bins_High 0.7545 0.067 11.314 0.000 0.624 0.885
lead_time_bins_Extreme 2.4118 0.073 33.051 0.000 2.269 2.555
========================================================================================================
print("Test performance:")
model_performance_classification_statsmodels(lg, X_test7, y_test)
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.756593 | 0.532368 | 0.651721 | 0.586029 |
# running a loop to drop variables with high p-value
# initial list of columns
cols = X_test7.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
X_test_aux = X_test7[cols]
# fitting the model
model = sm.Logit(y_test, X_test_aux).fit(disp=False)
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['no_of_children', 'no_of_week_nights', 'type_of_meal_plan', 'required_car_parking_space', 'room_type_reserved', 'arrival_month', 'arrival_date', 'repeated_guest', 'no_of_special_requests', 'room_cost_cat_Moderate', 'room_cost_cat_Premium', 'lead_time_bins_Moderate', 'lead_time_bins_High', 'lead_time_bins_Extreme']
X_test8 = X_test7[selected_features]
logit1 = sm.Logit(y_test, X_test8.astype(float))
lg1 = logit1.fit(disp=False)
print(lg1.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 10883
Model: Logit Df Residuals: 10869
Method: MLE Df Model: 13
Date: Sat, 20 Nov 2021 Pseudo R-squ.: 0.2131
Time: 03:42:01 Log-Likelihood: -5391.3
converged: True LL-Null: -6851.6
Covariance Type: nonrobust LLR p-value: 0.000
==============================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------------
no_of_children 0.1478 0.066 2.255 0.024 0.019 0.276
no_of_week_nights -0.0464 0.016 -2.814 0.005 -0.079 -0.014
type_of_meal_plan 0.1170 0.023 5.120 0.000 0.072 0.162
required_car_parking_space -1.3527 0.202 -6.686 0.000 -1.749 -0.956
room_type_reserved 0.0488 0.021 2.354 0.019 0.008 0.089
arrival_month -0.1350 0.007 -18.801 0.000 -0.149 -0.121
arrival_date -0.0222 0.002 -9.138 0.000 -0.027 -0.017
repeated_guest -2.4884 0.423 -5.884 0.000 -3.317 -1.660
no_of_special_requests -1.1245 0.039 -28.509 0.000 -1.202 -1.047
room_cost_cat_Moderate 0.2221 0.056 3.984 0.000 0.113 0.331
room_cost_cat_Premium 0.9690 0.076 12.670 0.000 0.819 1.119
lead_time_bins_Moderate 0.2103 0.069 3.068 0.002 0.076 0.345
lead_time_bins_High 0.7560 0.067 11.335 0.000 0.625 0.887
lead_time_bins_Extreme 2.4146 0.073 33.088 0.000 2.272 2.558
==============================================================================================
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_test8, y_test)
log_reg_model_test_perf = model_performance_classification_statsmodels(
lg1, X_test8, y_test
)
print("Test performance:")
log_reg_model_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.756593 | 0.532368 | 0.651721 | 0.586029 |
logit_roc_auc_test = roc_auc_score(y_test, lg1.predict(X_test8))
fpr, tpr, thresholds = roc_curve(y_test, lg1.predict(X_test8))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_test)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
Using model with threshold=0.76
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_test8, y_test, threshold=optimal_threshold_auc_roc)
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, X_test8, y_test, threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.73794 | 0.719194 | 0.576206 | 0.639808 |
Using model with threshold = 0.58
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_test8, y_test, threshold=optimal_threshold_curve)
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, X_test8, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.757879 | 0.568995 | 0.642102 | 0.603342 |
# training performance comparison
models_train_comp_data = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_data.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.76 Threshold",
"Logistic Regression-0.58 Threshold",
]
print("Training performance comparison:")
models_train_comp_data
Training performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.76 Threshold | Logistic Regression-0.58 Threshold | |
|---|---|---|---|
| Accuracy | 0.755710 | 0.736767 | 0.755868 |
| Recall | 0.547770 | 0.731317 | 0.582805 |
| Precision | 0.654242 | 0.579551 | 0.642669 |
| F1 | 0.596290 | 0.646648 | 0.611275 |
# testing performance comparison
models_test_comp_data = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_test_comp_data.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.76 Threshold",
"Logistic Regression-0.58 Threshold",
]
print("Test set performance comparison:")
models_test_comp_data
Test set performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.76 Threshold | Logistic Regression-0.58 Threshold | |
|---|---|---|---|
| Accuracy | 0.756593 | 0.737940 | 0.757879 |
| Recall | 0.532368 | 0.719194 | 0.568995 |
| Precision | 0.651721 | 0.576206 | 0.642102 |
| F1 | 0.586029 | 0.639808 | 0.603342 |
Predicting that the guest did Not Cancel the booking but in reality the guest Canceled the booking.
Predicting that the guest Canceled the booking but in reality the guest did Not Cancel the booking.
recall should be maximized, the greater the recall higher the chances of minimizing the false negatives.## Function to calculate recall score
def get_recall_score(model, predictors, target):
"""
model: classifier
predictors: independent variables
target: dependent variable
"""
prediction = model.predict(predictors)
return recall_score(target, prediction)
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
model = DecisionTreeClassifier(
criterion="gini", class_weight={0: 0.15, 1: 0.85}, random_state=1
)
model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_train = get_recall_score(model, X_train, y_train)
print("Recall Score:", decision_tree_perf_train)
Recall Score: 0.9983259595838814
Model is able to perfectly classify all the data points on the training set.
0 errors on the training set, each sample has been classified correctly.
As we know a decision tree will continue to grow and classify each data point correctly if no restrictions are applied as the trees will learn all the patterns in the training set.
This generally leads to overfitting of the model as Decision Tree will perform well on the training set but will fail to replicate the performance on the test set.
confusion_matrix_sklearn(model, X_test, y_test)
decision_tree_perf_test = get_recall_score(model, X_test, y_test)
print("Recall Score:", decision_tree_perf_test)
Recall Score: 0.7984099943214082
## creating a list of column names
feature_names = X_train.columns.to_list()
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
## Text report showing the rules of a decision tree -
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- lead_time <= 90.50 | |--- no_of_special_requests <= 1.50 | | |--- market_segment_type <= 3.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- market_segment_type <= 2.50 | | | | | |--- repeated_guest <= 0.50 | | | | | | |--- avg_price_per_room <= 52.50 | | | | | | | |--- weights: [18.15, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 52.50 | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | |--- lead_time <= 16.50 | | | | | | | | | |--- lead_time <= 11.50 | | | | | | | | | | |--- lead_time <= 9.50 | | | | | | | | | | | |--- truncated branch of depth 15 | | | | | | | | | | |--- lead_time > 9.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- lead_time > 11.50 | | | | | | | | | | |--- avg_price_per_room <= 97.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- avg_price_per_room > 97.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | |--- lead_time > 16.50 | | | | | | | | | |--- avg_price_per_room <= 135.00 | | | | | | | | | | |--- room_cost_cat_Premium <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 15 | | | | | | | | | | |--- room_cost_cat_Premium > 0.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- avg_price_per_room > 135.00 | | | | | | | | | | |--- no_of_special_requests <= 0.50 | | | | | | | | | | | |--- weights: [0.00, 6.80] class: 1 | | | | | | | | | | |--- no_of_special_requests > 0.50 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- arrival_month > 11.50 | | | | | | | | |--- weights: [9.60, 0.00] class: 0 | | | | | |--- repeated_guest > 0.50 | | | | | | |--- weights: [46.65, 0.00] class: 0 | | | | |--- market_segment_type > 2.50 | | | | | |--- avg_price_per_room <= 199.01 | | | | | | |--- weights: [268.50, 0.00] class: 0 | | | | | |--- avg_price_per_room > 199.01 | | | | | | |--- lead_time_bins_Moderate <= 0.50 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- lead_time_bins_Moderate > 0.50 | | | | | | | |--- weights: [0.00, 13.60] class: 1 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- lead_time <= 68.50 | | | | | |--- arrival_month <= 9.50 | | | | | | |--- no_of_special_requests <= 0.50 | | | | | | | |--- avg_price_per_room <= 63.29 | | | | | | | | |--- arrival_date <= 20.50 | | | | | | | | | |--- type_of_meal_plan <= 2.00 | | | | | | | | | | |--- lead_time_bins_High <= 0.50 | | | | | | | | | | | |--- weights: [7.50, 0.00] class: 0 | | | | | | | | | | |--- lead_time_bins_High > 0.50 | | | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | | | | |--- type_of_meal_plan > 2.00 | | | | | | | | | | |--- no_of_week_nights <= 1.00 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | | |--- no_of_week_nights > 1.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- arrival_date > 20.50 | | | | | | | | | |--- no_of_week_nights <= 0.50 | | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | | | |--- no_of_week_nights > 0.50 | | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | |--- avg_price_per_room > 63.29 | | | | | | | | |--- no_of_weekend_nights <= 3.50 | | | | | | | | | |--- arrival_month <= 2.50 | | | | | | | | | | |--- total_stay <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- total_stay > 1.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- arrival_month > 2.50 | | | | | | | | | | |--- lead_time <= 59.50 | | | | | | | | | | | |--- truncated branch of depth 14 | | | | | | | | | | |--- lead_time > 59.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | |--- no_of_weekend_nights > 3.50 | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | |--- weights: [0.00, 8.50] class: 1 | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- no_of_special_requests > 0.50 | | | | | | | |--- avg_price_per_room <= 127.00 | | | | | | | | |--- weights: [35.10, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 127.00 | | | | | | | | |--- market_segment_type <= 2.50 | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | | |--- market_segment_type > 2.50 | | | | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | |--- arrival_month > 9.50 | | | | | | |--- market_segment_type <= 0.50 | | | | | | | |--- arrival_date <= 19.00 | | | | | | | | |--- total_stay <= 4.50 | | | | | | | | | |--- repeated_guest <= 0.50 | | | | | | | | | | |--- lead_time <= 2.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 2.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- repeated_guest > 0.50 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- total_stay > 4.50 | | | | | | | | | |--- total_stay <= 6.00 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- total_stay > 6.00 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- arrival_date > 19.00 | | | | | | | | |--- repeated_guest <= 0.50 | | | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | | | | | |--- repeated_guest > 0.50 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | |--- market_segment_type > 0.50 | | | | | | | |--- lead_time <= 65.50 | | | | | | | | |--- no_of_week_nights <= 11.50 | | | | | | | | | |--- lead_time <= 0.50 | | | | | | | | | | |--- arrival_date <= 24.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- arrival_date > 24.00 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- lead_time > 0.50 | | | | | | | | | | |--- arrival_date <= 21.50 | | | | | | | | | | | |--- truncated branch of depth 12 | | | | | | | | | | |--- arrival_date > 21.50 | | | | | | | | | | | |--- weights: [26.55, 0.00] class: 0 | | | | | | | | |--- no_of_week_nights > 11.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- lead_time > 65.50 | | | | | | | | |--- arrival_month <= 10.50 | | | | | | | | | |--- room_type_reserved <= 1.00 | | | | | | | | | | |--- lead_time <= 66.50 | | | | | | | | | | | |--- weights: [0.75, 1.70] class: 1 | | | | | | | | | | |--- lead_time > 66.50 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | |--- room_type_reserved > 1.00 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- arrival_month > 10.50 | | | | | | | | | |--- no_of_special_requests <= 0.50 | | | | | | | | | | |--- weights: [1.50, 0.00] class: 0 | | | | | | | | | |--- no_of_special_requests > 0.50 | | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | |--- lead_time > 68.50 | | | | | |--- avg_price_per_room <= 99.98 | | | | | | |--- total_stay <= 3.50 | | | | | | | |--- arrival_date <= 25.50 | | | | | | | | |--- weights: [12.45, 0.00] class: 0 | | | | | | | |--- arrival_date > 25.50 | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | |--- weights: [1.65, 0.00] class: 0 | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | |--- lead_time <= 76.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 76.50 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- total_stay > 3.50 | | | | | | | |--- no_of_special_requests <= 0.50 | | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | | |--- avg_price_per_room <= 62.50 | | | | | | | | | | |--- weights: [1.65, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 62.50 | | | | | | | | | | |--- total_stay <= 5.00 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- total_stay > 5.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- arrival_month > 3.50 | | | | | | | | | |--- lead_time <= 73.50 | | | | | | | | | | |--- total_stay <= 4.50 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | | |--- total_stay > 4.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- lead_time > 73.50 | | | | | | | | | | |--- lead_time <= 81.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 81.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | |--- no_of_special_requests > 0.50 | | | | | | | | |--- weights: [6.15, 0.00] class: 0 | | | | | |--- avg_price_per_room > 99.98 | | | | | | |--- arrival_year <= 2017.50 | | | | | | | |--- weights: [1.80, 0.00] class: 0 | | | | | | |--- arrival_year > 2017.50 | | | | | | | |--- avg_price_per_room <= 132.43 | | | | | | | | |--- no_of_special_requests <= 0.50 | | | | | | | | | |--- arrival_month <= 2.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 2.50 | | | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- no_of_special_requests > 0.50 | | | | | | | | | |--- total_stay <= 2.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- total_stay > 2.50 | | | | | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 132.43 | | | | | | | | |--- no_of_special_requests <= 0.50 | | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | | | |--- no_of_special_requests > 0.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | |--- market_segment_type > 3.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- lead_time <= 3.50 | | | | | |--- avg_price_per_room <= 202.67 | | | | | | |--- arrival_month <= 8.50 | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | |--- weights: [10.05, 0.00] class: 0 | | | | | | | |--- arrival_month > 1.50 | | | | | | | | |--- total_stay <= 2.50 | | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | | |--- avg_price_per_room <= 77.50 | | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 77.50 | | | | | | | | | | | |--- truncated branch of depth 14 | | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | | |--- room_cost_cat_Moderate <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 12 | | | | | | | | | | |--- room_cost_cat_Moderate > 0.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | |--- total_stay > 2.50 | | | | | | | | | |--- arrival_date <= 10.50 | | | | | | | | | | |--- lead_time <= 2.50 | | | | | | | | | | | |--- weights: [2.85, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 2.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- arrival_date > 10.50 | | | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | |--- arrival_month > 8.50 | | | | | | | |--- avg_price_per_room <= 169.67 | | | | | | | | |--- avg_price_per_room <= 94.66 | | | | | | | | | |--- weights: [11.70, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 94.66 | | | | | | | | | |--- arrival_month <= 10.50 | | | | | | | | | | |--- arrival_date <= 28.00 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- arrival_date > 28.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- arrival_month > 10.50 | | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- avg_price_per_room > 169.67 | | | | | | | | |--- avg_price_per_room <= 182.25 | | | | | | | | | |--- arrival_date <= 11.00 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | |--- arrival_date > 11.00 | | | | | | | | | | |--- arrival_date <= 24.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_date > 24.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 182.25 | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | |--- avg_price_per_room > 202.67 | | | | | | |--- arrival_month <= 11.00 | | | | | | | |--- weights: [0.00, 12.75] class: 1 | | | | | | |--- arrival_month > 11.00 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- lead_time > 3.50 | | | | | |--- arrival_year <= 2017.50 | | | | | | |--- lead_time <= 62.50 | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | |--- total_stay <= 2.50 | | | | | | | | | |--- avg_price_per_room <= 157.50 | | | | | | | | | | |--- required_car_parking_space <= 0.50 | | | | | | | | | | | |--- weights: [4.80, 0.00] class: 0 | | | | | | | | | | |--- required_car_parking_space > 0.50 | | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 157.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- total_stay > 2.50 | | | | | | | | | |--- arrival_date <= 7.50 | | | | | | | | | | |--- weights: [2.70, 0.00] class: 0 | | | | | | | | | |--- arrival_date > 7.50 | | | | | | | | | | |--- avg_price_per_room <= 212.33 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | | |--- avg_price_per_room > 212.33 | | | | | | | | | | | |--- weights: [0.00, 3.40] class: 1 | | | | | | | |--- arrival_month > 9.50 | | | | | | | | |--- avg_price_per_room <= 129.92 | | | | | | | | | |--- lead_time <= 26.50 | | | | | | | | | | |--- weights: [19.95, 0.00] class: 0 | | | | | | | | | |--- lead_time > 26.50 | | | | | | | | | | |--- arrival_date <= 6.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- arrival_date > 6.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | |--- avg_price_per_room > 129.92 | | | | | | | | | |--- avg_price_per_room <= 145.50 | | | | | | | | | | |--- lead_time <= 9.50 | | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 9.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- avg_price_per_room > 145.50 | | | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | |--- lead_time > 62.50 | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | |--- avg_price_per_room <= 47.63 | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 47.63 | | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | | |--- avg_price_per_room <= 78.62 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- avg_price_per_room > 78.62 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- arrival_date > 27.50 | | | | | | | | |--- weights: [1.65, 0.00] class: 0 | | | | | |--- arrival_year > 2017.50 | | | | | | |--- arrival_month <= 1.50 | | | | | | | |--- lead_time <= 24.50 | | | | | | | | |--- weights: [14.40, 0.00] class: 0 | | | | | | | |--- lead_time > 24.50 | | | | | | | | |--- avg_price_per_room <= 57.94 | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 57.94 | | | | | | | | | |--- arrival_date <= 9.50 | | | | | | | | | | |--- room_cost_cat_Premium <= 0.50 | | | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | | | | | |--- room_cost_cat_Premium > 0.50 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | |--- arrival_date > 9.50 | | | | | | | | | | |--- arrival_date <= 30.00 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- arrival_date > 30.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | |--- arrival_month > 1.50 | | | | | | | |--- avg_price_per_room <= 61.67 | | | | | | | | |--- lead_time_bins_Moderate <= 0.50 | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | |--- weights: [2.40, 0.00] class: 0 | | | | | | | | |--- lead_time_bins_Moderate > 0.50 | | | | | | | | | |--- arrival_date <= 9.50 | | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | | | | |--- arrival_date > 9.50 | | | | | | | | | | |--- arrival_date <= 18.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- arrival_date > 18.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- avg_price_per_room > 61.67 | | | | | | | | |--- required_car_parking_space <= 0.50 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- lead_time <= 9.50 | | | | | | | | | | | |--- truncated branch of depth 16 | | | | | | | | | | |--- lead_time > 9.50 | | | | | | | | | | | |--- truncated branch of depth 30 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- lead_time <= 24.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 24.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | |--- required_car_parking_space > 0.50 | | | | | | | | | |--- avg_price_per_room <= 204.50 | | | | | | | | | | |--- no_of_week_nights <= 0.50 | | | | | | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | | | | | | | |--- no_of_week_nights > 0.50 | | | | | | | | | | | |--- weights: [4.80, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 204.50 | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | |--- no_of_special_requests > 0.50 | | | | |--- arrival_month <= 11.50 | | | | | |--- lead_time <= 6.50 | | | | | | |--- avg_price_per_room <= 157.64 | | | | | | | |--- no_of_week_nights <= 10.00 | | | | | | | | |--- lead_time <= 4.50 | | | | | | | | | |--- arrival_date <= 4.50 | | | | | | | | | | |--- weights: [11.85, 0.00] class: 0 | | | | | | | | | |--- arrival_date > 4.50 | | | | | | | | | | |--- avg_price_per_room <= 142.10 | | | | | | | | | | | |--- truncated branch of depth 15 | | | | | | | | | | |--- avg_price_per_room > 142.10 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- lead_time > 4.50 | | | | | | | | | |--- arrival_date <= 13.50 | | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | | |--- truncated branch of depth 12 | | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | | |--- weights: [1.65, 0.00] class: 0 | | | | | | | | | |--- arrival_date > 13.50 | | | | | | | | | | |--- total_stay <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- total_stay > 1.50 | | | | | | | | | | | |--- weights: [11.40, 0.00] class: 0 | | | | | | | |--- no_of_week_nights > 10.00 | | | | | | | | |--- lead_time <= 4.50 | | | | | | | | | |--- lead_time <= 2.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | |--- lead_time > 2.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- lead_time > 4.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 157.64 | | | | | | | |--- arrival_date <= 18.50 | | | | | | | | |--- arrival_date <= 10.00 | | | | | | | | | |--- arrival_month <= 10.50 | | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 10.50 | | | | | | | | | | |--- lead_time <= 4.00 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 4.00 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- arrival_date > 10.00 | | | | | | | | | |--- arrival_date <= 16.50 | | | | | | | | | | |--- room_type_reserved <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- room_type_reserved > 1.50 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | |--- arrival_date > 16.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- arrival_date > 18.50 | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | |--- lead_time <= 1.50 | | | | | | | | | | |--- arrival_date <= 20.50 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | | |--- arrival_date > 20.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- lead_time > 1.50 | | | | | | | | | | |--- weights: [1.80, 0.00] class: 0 | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | |--- type_of_meal_plan <= 0.50 | | | | | | | | | | |--- weights: [4.50, 0.00] class: 0 | | | | | | | | | |--- type_of_meal_plan > 0.50 | | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | |--- lead_time > 6.50 | | | | | | |--- required_car_parking_space <= 0.50 | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | |--- weights: [15.60, 0.00] class: 0 | | | | | | | |--- arrival_month > 1.50 | | | | | | | | |--- avg_price_per_room <= 121.78 | | | | | | | | | |--- total_stay <= 6.50 | | | | | | | | | | |--- avg_price_per_room <= 67.36 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | | |--- avg_price_per_room > 67.36 | | | | | | | | | | | |--- truncated branch of depth 22 | | | | | | | | | |--- total_stay > 6.50 | | | | | | | | | | |--- arrival_date <= 29.50 | | | | | | | | | | | |--- truncated branch of depth 17 | | | | | | | | | | |--- arrival_date > 29.50 | | | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 121.78 | | | | | | | | | |--- arrival_month <= 8.50 | | | | | | | | | | |--- arrival_date <= 19.50 | | | | | | | | | | | |--- truncated branch of depth 26 | | | | | | | | | | |--- arrival_date > 19.50 | | | | | | | | | | | |--- truncated branch of depth 17 | | | | | | | | | |--- arrival_month > 8.50 | | | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | | | |--- truncated branch of depth 26 | | | | | | |--- required_car_parking_space > 0.50 | | | | | | | |--- weights: [19.65, 0.00] class: 0 | | | | |--- arrival_month > 11.50 | | | | | |--- total_stay <= 14.50 | | | | | | |--- weights: [59.40, 0.00] class: 0 | | | | | |--- total_stay > 14.50 | | | | | | |--- weights: [0.00, 0.85] class: 1 | |--- no_of_special_requests > 1.50 | | |--- no_of_week_nights <= 3.50 | | | |--- weights: [318.90, 0.00] class: 0 | | |--- no_of_week_nights > 3.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- lead_time <= 4.50 | | | | | |--- weights: [4.50, 0.00] class: 0 | | | | |--- lead_time > 4.50 | | | | | |--- arrival_date <= 3.50 | | | | | | |--- weights: [3.30, 0.00] class: 0 | | | | | |--- arrival_date > 3.50 | | | | | | |--- avg_price_per_room <= 93.24 | | | | | | | |--- room_type_reserved <= 0.50 | | | | | | | | |--- avg_price_per_room <= 70.49 | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 70.49 | | | | | | | | | |--- arrival_date <= 23.50 | | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- arrival_date > 23.50 | | | | | | | | | | |--- avg_price_per_room <= 91.00 | | | | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 91.00 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- room_type_reserved > 0.50 | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 93.24 | | | | | | | |--- avg_price_per_room <= 107.29 | | | | | | | | |--- total_stay <= 6.50 | | | | | | | | | |--- market_segment_type <= 3.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- market_segment_type > 3.50 | | | | | | | | | | |--- weights: [5.55, 0.00] class: 0 | | | | | | | | |--- total_stay > 6.50 | | | | | | | | | |--- lead_time <= 20.00 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | |--- lead_time > 20.00 | | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 107.29 | | | | | | | | |--- lead_time <= 82.50 | | | | | | | | | |--- lead_time <= 80.00 | | | | | | | | | | |--- lead_time <= 34.00 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | | |--- lead_time > 34.00 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | |--- lead_time > 80.00 | | | | | | | | | | |--- avg_price_per_room <= 129.07 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | | |--- avg_price_per_room > 129.07 | | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | |--- lead_time > 82.50 | | | | | | | | | |--- weights: [1.65, 0.00] class: 0 | | | |--- no_of_special_requests > 2.50 | | | | |--- market_segment_type <= 2.50 | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- market_segment_type > 2.50 | | | | | |--- weights: [10.35, 0.00] class: 0 |--- lead_time > 90.50 | |--- lead_time <= 151.50 | | |--- no_of_special_requests <= 0.50 | | | |--- arrival_month <= 1.50 | | | | |--- weights: [14.55, 0.00] class: 0 | | | |--- arrival_month > 1.50 | | | | |--- market_segment_type <= 3.50 | | | | | |--- arrival_month <= 11.50 | | | | | | |--- lead_time <= 116.50 | | | | | | | |--- total_stay <= 3.50 | | | | | | | | |--- lead_time <= 99.50 | | | | | | | | | |--- avg_price_per_room <= 90.47 | | | | | | | | | | |--- arrival_date <= 10.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_date > 10.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | |--- avg_price_per_room > 90.47 | | | | | | | | | | |--- avg_price_per_room <= 94.75 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- avg_price_per_room > 94.75 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | |--- lead_time > 99.50 | | | | | | | | | |--- avg_price_per_room <= 58.75 | | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 58.75 | | | | | | | | | | |--- arrival_month <= 4.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- arrival_month > 4.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | |--- total_stay > 3.50 | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | |--- total_stay <= 4.50 | | | | | | | | | | |--- avg_price_per_room <= 69.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- avg_price_per_room > 69.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- total_stay > 4.50 | | | | | | | | | | |--- arrival_date <= 5.50 | | | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | | | | |--- arrival_date > 5.50 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | |--- total_stay <= 5.50 | | | | | | | | | | |--- avg_price_per_room <= 105.00 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- avg_price_per_room > 105.00 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- total_stay > 5.50 | | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | |--- lead_time > 116.50 | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | |--- room_cost_cat_Premium <= 0.50 | | | | | | | | | |--- weights: [13.05, 0.00] class: 0 | | | | | | | | |--- room_cost_cat_Premium > 0.50 | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | | |--- avg_price_per_room <= 89.88 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | | | |--- avg_price_per_room > 89.88 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | | | |--- weights: [1.95, 0.85] class: 0 | | | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | |--- room_cost_cat_Moderate <= 0.50 | | | | | | | | | | |--- total_stay <= 3.50 | | | | | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | | | | | | | |--- total_stay > 3.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | |--- room_cost_cat_Moderate > 0.50 | | | | | | | | | | |--- lead_time <= 146.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- lead_time > 146.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | |--- arrival_month > 11.50 | | | | | | |--- avg_price_per_room <= 177.83 | | | | | | | |--- weights: [15.60, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 177.83 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- market_segment_type > 3.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- avg_price_per_room <= 71.92 | | | | | | | |--- arrival_date <= 13.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | | |--- avg_price_per_room <= 63.05 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 63.05 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 3.50 | | | | | | | | | | |--- weights: [2.25, 0.00] class: 0 | | | | | | | |--- arrival_date > 13.50 | | | | | | | | |--- lead_time <= 142.00 | | | | | | | | | |--- avg_price_per_room <= 70.29 | | | | | | | | | | |--- type_of_meal_plan <= 2.00 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | | | |--- type_of_meal_plan > 2.00 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- avg_price_per_room > 70.29 | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | |--- lead_time > 142.00 | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 71.92 | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | |--- avg_price_per_room <= 114.65 | | | | | | | | | |--- room_type_reserved <= 0.50 | | | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 18 | | | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- room_type_reserved > 0.50 | | | | | | | | | | |--- avg_price_per_room <= 89.02 | | | | | | | | | | | |--- weights: [1.80, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 89.02 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | |--- avg_price_per_room > 114.65 | | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | | |--- total_stay <= 3.50 | | | | | | | | | | | |--- weights: [0.00, 22.10] class: 1 | | | | | | | | | | |--- total_stay > 3.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- arrival_month > 5.50 | | | | | | | | |--- lead_time <= 91.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | |--- arrival_date <= 22.50 | | | | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | | | | | |--- arrival_date > 22.50 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- lead_time > 91.50 | | | | | | | | | |--- avg_price_per_room <= 99.38 | | | | | | | | | | |--- avg_price_per_room <= 97.95 | | | | | | | | | | | |--- truncated branch of depth 17 | | | | | | | | | | |--- avg_price_per_room > 97.95 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- avg_price_per_room > 99.38 | | | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | | | |--- truncated branch of depth 14 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- no_of_weekend_nights <= 3.00 | | | | | | | |--- weights: [2.85, 0.00] class: 0 | | | | | | |--- no_of_weekend_nights > 3.00 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- arrival_month <= 8.50 | | | | | |--- arrival_year <= 2017.50 | | | | | | |--- type_of_meal_plan <= 0.50 | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | |--- avg_price_per_room <= 63.45 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 63.45 | | | | | | | | | |--- required_car_parking_space <= 0.50 | | | | | | | | | | |--- arrival_date <= 1.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- arrival_date > 1.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- required_car_parking_space > 0.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- arrival_month > 7.50 | | | | | | | | |--- lead_time <= 110.50 | | | | | | | | | |--- total_stay <= 4.00 | | | | | | | | | | |--- weights: [1.65, 0.00] class: 0 | | | | | | | | | |--- total_stay > 4.00 | | | | | | | | | | |--- lead_time <= 99.50 | | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | | | |--- lead_time > 99.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- lead_time > 110.50 | | | | | | | | | |--- total_stay <= 5.50 | | | | | | | | | | |--- lead_time <= 149.00 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | | |--- lead_time > 149.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- total_stay > 5.50 | | | | | | | | | | |--- arrival_date <= 11.00 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- arrival_date > 11.00 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | |--- type_of_meal_plan > 0.50 | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | |--- arrival_year > 2017.50 | | | | | | |--- lead_time <= 142.50 | | | | | | | |--- avg_price_per_room <= 200.86 | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | |--- avg_price_per_room <= 71.96 | | | | | | | | | | |--- weights: [5.85, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 71.96 | | | | | | | | | | |--- avg_price_per_room <= 72.37 | | | | | | | | | | | |--- weights: [0.00, 3.40] class: 1 | | | | | | | | | | |--- avg_price_per_room > 72.37 | | | | | | | | | | | |--- truncated branch of depth 20 | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | |--- arrival_date <= 19.50 | | | | | | | | | | |--- avg_price_per_room <= 92.08 | | | | | | | | | | | |--- weights: [7.65, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 92.08 | | | | | | | | | | | |--- truncated branch of depth 16 | | | | | | | | | |--- arrival_date > 19.50 | | | | | | | | | | |--- avg_price_per_room <= 130.95 | | | | | | | | | | | |--- truncated branch of depth 13 | | | | | | | | | | |--- avg_price_per_room > 130.95 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | |--- avg_price_per_room > 200.86 | | | | | | | | |--- weights: [0.00, 5.95] class: 1 | | | | | | |--- lead_time > 142.50 | | | | | | | |--- avg_price_per_room <= 100.25 | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | |--- weights: [4.05, 0.00] class: 0 | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | |--- room_type_reserved <= 0.50 | | | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- room_type_reserved > 0.50 | | | | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 100.25 | | | | | | | | |--- required_car_parking_space <= 0.50 | | | | | | | | | |--- arrival_date <= 7.50 | | | | | | | | | | |--- lead_time <= 143.50 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 143.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- arrival_date > 7.50 | | | | | | | | | | |--- lead_time <= 150.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- lead_time > 150.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- required_car_parking_space > 0.50 | | | | | | | | | |--- weights: [1.50, 0.00] class: 0 | | | | |--- arrival_month > 8.50 | | | | | |--- avg_price_per_room <= 71.12 | | | | | | |--- total_stay <= 6.50 | | | | | | | |--- total_stay <= 1.50 | | | | | | | | |--- arrival_month <= 11.00 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- arrival_month > 11.00 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- total_stay > 1.50 | | | | | | | | |--- market_segment_type <= 3.50 | | | | | | | | | |--- weights: [1.95, 0.00] class: 0 | | | | | | | | |--- market_segment_type > 3.50 | | | | | | | | | |--- weights: [2.40, 0.00] class: 0 | | | | | | |--- total_stay > 6.50 | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | |--- avg_price_per_room > 71.12 | | | | | | |--- required_car_parking_space <= 0.50 | | | | | | | |--- total_stay <= 3.50 | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | |--- total_stay <= 2.50 | | | | | | | | | | |--- type_of_meal_plan <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- type_of_meal_plan > 1.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- total_stay > 2.50 | | | | | | | | | | |--- arrival_date <= 9.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_date > 9.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | |--- room_type_reserved <= 3.50 | | | | | | | | | | |--- arrival_date <= 3.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- arrival_date > 3.50 | | | | | | | | | | | |--- truncated branch of depth 15 | | | | | | | | | |--- room_type_reserved > 3.50 | | | | | | | | | | |--- lead_time <= 97.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 97.00 | | | | | | | | | | | |--- weights: [1.65, 0.00] class: 0 | | | | | | | |--- total_stay > 3.50 | | | | | | | | |--- lead_time <= 101.00 | | | | | | | | | |--- lead_time <= 92.50 | | | | | | | | | | |--- arrival_date <= 7.50 | | | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | | | | |--- arrival_date > 7.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- lead_time > 92.50 | | | | | | | | | | |--- room_type_reserved <= 4.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- room_type_reserved > 4.00 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- lead_time > 101.00 | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | |--- arrival_date <= 5.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_date > 5.00 | | | | | | | | | | | |--- truncated branch of depth 12 | | | | | | |--- required_car_parking_space > 0.50 | | | | | | | |--- avg_price_per_room <= 99.08 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 99.08 | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [13.50, 0.00] class: 0 | |--- lead_time > 151.50 | | |--- avg_price_per_room <= 100.04 | | | |--- no_of_special_requests <= 0.50 | | | | |--- no_of_adults <= 1.50 | | | | | |--- market_segment_type <= 3.50 | | | | | | |--- lead_time <= 163.50 | | | | | | | |--- total_stay <= 2.50 | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | |--- total_stay > 2.50 | | | | | | | | |--- total_stay <= 3.50 | | | | | | | | | |--- weights: [0.15, 0.85] class: 1 | | | | | | | | |--- total_stay > 3.50 | | | | | | | | | |--- weights: [0.00, 12.75] class: 1 | | | | | | |--- lead_time > 163.50 | | | | | | | |--- lead_time <= 341.00 | | | | | | | | |--- lead_time <= 173.00 | | | | | | | | | |--- arrival_date <= 3.50 | | | | | | | | | | |--- total_stay <= 1.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- total_stay > 1.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- arrival_date > 3.50 | | | | | | | | | | |--- arrival_month <= 5.00 | | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | | | |--- arrival_month > 5.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- lead_time > 173.00 | | | | | | | | | |--- arrival_month <= 5.50 | | | | | | | | | | |--- arrival_date <= 7.50 | | | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | | | | |--- arrival_date > 7.50 | | | | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | | | | |--- arrival_month > 5.50 | | | | | | | | | | |--- avg_price_per_room <= 98.00 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- avg_price_per_room > 98.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- lead_time > 341.00 | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | |--- avg_price_per_room <= 88.00 | | | | | | | | | | |--- weights: [0.00, 8.50] class: 1 | | | | | | | | | |--- avg_price_per_room > 88.00 | | | | | | | | | | |--- arrival_date <= 8.50 | | | | | | | | | | | |--- weights: [0.15, 0.85] class: 1 | | | | | | | | | | |--- arrival_date > 8.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | |--- room_cost_cat_Moderate <= 0.50 | | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | | | |--- room_cost_cat_Moderate > 0.50 | | | | | | | | | | |--- weights: [0.45, 1.70] class: 1 | | | | | |--- market_segment_type > 3.50 | | | | | | |--- avg_price_per_room <= 2.50 | | | | | | | |--- lead_time <= 285.50 | | | | | | | | |--- weights: [1.65, 0.00] class: 0 | | | | | | | |--- lead_time > 285.50 | | | | | | | | |--- total_stay <= 4.50 | | | | | | | | | |--- lead_time <= 312.00 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | |--- lead_time > 312.00 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- total_stay > 4.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 2.50 | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | |--- weights: [0.00, 49.30] class: 1 | | | | | | | |--- arrival_month > 11.50 | | | | | | | | |--- room_cost_cat_Moderate <= 0.50 | | | | | | | | | |--- weights: [0.00, 5.10] class: 1 | | | | | | | | |--- room_cost_cat_Moderate > 0.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- no_of_adults > 1.50 | | | | | |--- arrival_year <= 2017.50 | | | | | | |--- lead_time <= 215.50 | | | | | | | |--- lead_time <= 164.00 | | | | | | | | |--- avg_price_per_room <= 77.59 | | | | | | | | | |--- lead_time <= 161.00 | | | | | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | | | | | |--- lead_time > 161.00 | | | | | | | | | | |--- avg_price_per_room <= 52.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 52.50 | | | | | | | | | | | |--- weights: [1.05, 0.85] class: 0 | | | | | | | | |--- avg_price_per_room > 77.59 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- lead_time > 164.00 | | | | | | | | |--- arrival_date <= 9.50 | | | | | | | | | |--- avg_price_per_room <= 74.62 | | | | | | | | | | |--- market_segment_type <= 3.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- market_segment_type > 3.50 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | |--- avg_price_per_room > 74.62 | | | | | | | | | | |--- weights: [0.00, 11.90] class: 1 | | | | | | | | |--- arrival_date > 9.50 | | | | | | | | | |--- weights: [0.00, 44.20] class: 1 | | | | | | |--- lead_time > 215.50 | | | | | | | |--- market_segment_type <= 3.50 | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | |--- weights: [10.80, 0.00] class: 0 | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- market_segment_type > 3.50 | | | | | | | | |--- weights: [0.00, 5.10] class: 1 | | | | | |--- arrival_year > 2017.50 | | | | | | |--- avg_price_per_room <= 82.47 | | | | | | | |--- lead_time <= 203.50 | | | | | | | | |--- market_segment_type <= 3.50 | | | | | | | | | |--- no_of_previous_cancellations <= 6.50 | | | | | | | | | | |--- lead_time <= 170.50 | | | | | | | | | | | |--- weights: [5.70, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 170.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | |--- no_of_previous_cancellations > 6.50 | | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | | |--- market_segment_type > 3.50 | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | |--- weights: [0.00, 12.75] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | |--- weights: [0.00, 22.10] class: 1 | | | | | | | |--- lead_time > 203.50 | | | | | | | | |--- arrival_date <= 3.50 | | | | | | | | | |--- market_segment_type <= 3.50 | | | | | | | | | | |--- avg_price_per_room <= 60.68 | | | | | | | | | | | |--- weights: [2.55, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 60.68 | | | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | | | |--- market_segment_type > 3.50 | | | | | | | | | | |--- no_of_week_nights <= 2.00 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- no_of_week_nights > 2.00 | | | | | | | | | | | |--- weights: [0.00, 7.65] class: 1 | | | | | | | | |--- arrival_date > 3.50 | | | | | | | | | |--- market_segment_type <= 3.50 | | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- market_segment_type > 3.50 | | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | | |--- weights: [0.00, 71.40] class: 1 | | | | | | |--- avg_price_per_room > 82.47 | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | |--- market_segment_type <= 2.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- market_segment_type > 2.50 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- room_type_reserved <= 2.00 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | | |--- room_type_reserved > 2.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- market_segment_type <= 3.50 | | | | | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | | | | | | | |--- market_segment_type > 3.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | |--- no_of_special_requests > 0.50 | | | | |--- no_of_weekend_nights <= 0.50 | | | | | |--- lead_time <= 180.50 | | | | | | |--- lead_time <= 159.50 | | | | | | | |--- arrival_month <= 8.50 | | | | | | | | |--- arrival_month <= 6.50 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- arrival_month > 6.50 | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | | |--- arrival_month > 8.50 | | | | | | | | |--- avg_price_per_room <= 74.05 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 74.05 | | | | | | | | | |--- lead_time <= 158.50 | | | | | | | | | | |--- weights: [0.00, 3.40] class: 1 | | | | | | | | | |--- lead_time > 158.50 | | | | | | | | | | |--- avg_price_per_room <= 89.62 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 89.62 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- lead_time > 159.50 | | | | | | | |--- arrival_date <= 1.50 | | | | | | | | |--- lead_time <= 176.50 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- lead_time > 176.50 | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | |--- arrival_date > 1.50 | | | | | | | | |--- no_of_adults <= 0.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- no_of_adults > 0.50 | | | | | | | | | |--- weights: [7.20, 0.00] class: 0 | | | | | |--- lead_time > 180.50 | | | | | | |--- no_of_special_requests <= 2.50 | | | | | | | |--- market_segment_type <= 3.50 | | | | | | | | |--- avg_price_per_room <= 86.22 | | | | | | | | | |--- weights: [1.65, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 86.22 | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | |--- room_type_reserved <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- room_type_reserved > 1.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- market_segment_type > 3.50 | | | | | | | | |--- avg_price_per_room <= 33.75 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 33.75 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- weights: [0.00, 106.25] class: 1 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | |--- no_of_special_requests > 2.50 | | | | | | | |--- lead_time <= 195.00 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- lead_time > 195.00 | | | | | | | | |--- weights: [1.50, 0.00] class: 0 | | | | |--- no_of_weekend_nights > 0.50 | | | | | |--- market_segment_type <= 3.50 | | | | | | |--- lead_time <= 348.50 | | | | | | | |--- no_of_special_requests <= 1.50 | | | | | | | | |--- weights: [19.50, 0.00] class: 0 | | | | | | | |--- no_of_special_requests > 1.50 | | | | | | | | |--- total_stay <= 5.50 | | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | | | |--- total_stay > 5.50 | | | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | | | |--- arrival_date <= 26.50 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | | |--- arrival_date > 26.50 | | | | | | | | | | | |--- weights: [0.30, 0.85] class: 1 | | | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | |--- lead_time > 348.50 | | | | | | | |--- avg_price_per_room <= 58.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 58.50 | | | | | | | | |--- lead_time <= 372.50 | | | | | | | | | |--- weights: [0.90, 1.70] class: 1 | | | | | | | | |--- lead_time > 372.50 | | | | | | | | | |--- weights: [0.15, 0.85] class: 1 | | | | | |--- market_segment_type > 3.50 | | | | | | |--- arrival_date <= 24.50 | | | | | | | |--- total_stay <= 9.50 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- avg_price_per_room <= 76.48 | | | | | | | | | | |--- lead_time <= 288.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 288.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- avg_price_per_room > 76.48 | | | | | | | | | | |--- lead_time <= 233.00 | | | | | | | | | | | |--- truncated branch of depth 12 | | | | | | | | | | |--- lead_time > 233.00 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- arrival_date <= 14.50 | | | | | | | | | | |--- lead_time <= 217.00 | | | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 217.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- arrival_date > 14.50 | | | | | | | | | | |--- avg_price_per_room <= 67.38 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 67.38 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | |--- total_stay > 9.50 | | | | | | | | |--- avg_price_per_room <= 66.06 | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 66.06 | | | | | | | | | |--- no_of_special_requests <= 2.50 | | | | | | | | | | |--- lead_time <= 181.00 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 181.00 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | |--- no_of_special_requests > 2.50 | | | | | | | | | | |--- avg_price_per_room <= 79.15 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 79.15 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | |--- arrival_date > 24.50 | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | |--- no_of_special_requests <= 1.50 | | | | | | | | | |--- lead_time <= 165.00 | | | | | | | | | | |--- lead_time <= 157.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 157.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- lead_time > 165.00 | | | | | | | | | | |--- room_type_reserved <= 2.00 | | | | | | | | | | | |--- weights: [4.35, 0.00] class: 0 | | | | | | | | | | |--- room_type_reserved > 2.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- no_of_special_requests > 1.50 | | | | | | | | | |--- lead_time <= 197.50 | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | | |--- lead_time > 197.50 | | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | |--- arrival_month > 7.50 | | | | | | | | |--- no_of_special_requests <= 2.50 | | | | | | | | | |--- avg_price_per_room <= 55.92 | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 55.92 | | | | | | | | | | |--- no_of_week_nights <= 0.50 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | | |--- no_of_week_nights > 0.50 | | | | | | | | | | | |--- truncated branch of depth 13 | | | | | | | | |--- no_of_special_requests > 2.50 | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | |--- avg_price_per_room > 100.04 | | | |--- no_of_special_requests <= 2.50 | | | | |--- arrival_month <= 11.50 | | | | | |--- weights: [0.00, 1791.80] class: 1 | | | | |--- arrival_month > 11.50 | | | | | |--- no_of_special_requests <= 0.50 | | | | | | |--- weights: [7.05, 0.00] class: 0 | | | | | |--- no_of_special_requests > 0.50 | | | | | | |--- arrival_date <= 24.50 | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | |--- arrival_date > 24.50 | | | | | | | |--- lead_time <= 153.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- lead_time > 153.50 | | | | | | | | |--- room_type_reserved <= 1.50 | | | | | | | | | |--- no_of_week_nights <= 3.00 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | |--- no_of_week_nights > 3.00 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | |--- room_type_reserved > 1.50 | | | | | | | | | |--- lead_time <= 172.50 | | | | | | | | | | |--- avg_price_per_room <= 135.49 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 135.49 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | |--- lead_time > 172.50 | | | | | | | | | | |--- weights: [0.00, 11.05] class: 1 | | | |--- no_of_special_requests > 2.50 | | | | |--- no_of_children <= 0.50 | | | | | |--- weights: [3.45, 0.00] class: 0 | | | | |--- no_of_children > 0.50 | | | | | |--- weights: [1.20, 0.00] class: 0
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp lead_time 0.284550 no_of_special_requests 0.132515 avg_price_per_room 0.127550 market_segment_type 0.101483 arrival_month 0.085268 arrival_date 0.084116 total_stay 0.033915 no_of_week_nights 0.030165 no_of_weekend_nights 0.029692 no_of_adults 0.023457 arrival_year 0.016306 type_of_meal_plan 0.010091 room_type_reserved 0.009914 required_car_parking_space 0.008527 no_of_children 0.004601 room_cost_cat_Moderate 0.004385 repeated_guest 0.004323 lead_time_bins_Moderate 0.003534 room_cost_cat_Premium 0.002937 lead_time_bins_High 0.000949 no_of_previous_cancellations 0.000845 lead_time_bins_Extreme 0.000602 no_of_previous_bookings_not_canceled 0.000274
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight={0: 0.15, 1: 0.85})
# Grid of parameters to choose from
parameters = {
"max_depth": [5, 10, 15, None],
"criterion": ["entropy", "gini"],
"splitter": ["best", "random"],
"min_impurity_decrease": [0.00001, 0.0001, 0.01],
}
# Type of scoring used to compare parameter combinations
scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, max_depth=5,
min_impurity_decrease=0.01, random_state=1,
splitter='random')
confusion_matrix_sklearn(estimator, X_train, y_train)
decision_tree_tune_perf_train = get_recall_score(estimator, X_train, y_train)
print("Recall Score:", decision_tree_tune_perf_train)
Recall Score: 0.9504962334090638
confusion_matrix_sklearn(estimator, X_test, y_test)
decision_tree_tune_perf_test = get_recall_score(estimator, X_test, y_test)
print("Recall Score:", decision_tree_tune_perf_test)
Recall Score: 0.9452015900056786
plt.figure(figsize=(15, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- lead_time_bins_Extreme <= 0.12 | |--- no_of_special_requests <= 1.54 | | |--- no_of_special_requests <= 0.35 | | | |--- arrival_year <= 2017.81 | | | | |--- weights: [306.45, 263.50] class: 0 | | | |--- arrival_year > 2017.81 | | | | |--- market_segment_type <= 3.63 | | | | | |--- weights: [413.40, 460.70] class: 1 | | | | |--- market_segment_type > 3.63 | | | | | |--- weights: [282.75, 2023.85] class: 1 | | |--- no_of_special_requests > 0.35 | | | |--- weights: [797.40, 812.60] class: 1 | |--- no_of_special_requests > 1.54 | | |--- weights: [413.55, 88.40] class: 0 |--- lead_time_bins_Extreme > 0.12 | |--- weights: [340.80, 3459.50] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
# Here we will see that importance of features has increased
Imp lead_time_bins_Extreme 0.393271 no_of_special_requests 0.345502 market_segment_type 0.160128 arrival_year 0.101099 no_of_week_nights 0.000000 no_of_previous_bookings_not_canceled 0.000000 lead_time_bins_High 0.000000 lead_time_bins_Moderate 0.000000 room_cost_cat_Premium 0.000000 room_cost_cat_Moderate 0.000000 total_stay 0.000000 no_of_weekend_nights 0.000000 avg_price_per_room 0.000000 no_of_previous_cancellations 0.000000 type_of_meal_plan 0.000000 repeated_guest 0.000000 no_of_children 0.000000 arrival_date 0.000000 arrival_month 0.000000 lead_time 0.000000 room_type_reserved 0.000000 required_car_parking_space 0.000000 no_of_adults 0.000000
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
* plt.show()
File "<ipython-input-142-d955ac29270c>", line 9 * plt.show() ^ SyntaxError: can't use starred expression here
clf = DecisionTreeClassifier(random_state=1, class_weight={0: 0.15, 1: 0.85})
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000e+00 | 0.006774 |
| 1 | 1.034059e-20 | 0.006774 |
| 2 | 1.034059e-20 | 0.006774 |
| 3 | 1.034059e-20 | 0.006774 |
| 4 | 1.034059e-20 | 0.006774 |
| ... | ... | ... |
| 1885 | 5.456394e-03 | 0.272057 |
| 1886 | 6.138880e-03 | 0.278196 |
| 1887 | 1.459269e-02 | 0.292789 |
| 1888 | 2.518565e-02 | 0.343160 |
| 1889 | 4.577428e-02 | 0.388934 |
1890 rows × 2 columns
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight={0: 0.15, 1: 0.85}
)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.04577428273917339
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.025185648755395765,
class_weight={0: 0.15, 1: 0.85}, random_state=1)
best_model.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.025185648755395765,
class_weight={0: 0.15, 1: 0.85}, random_state=1)
confusion_matrix_sklearn(best_model, X_train, y_train)
print("Recall Score:", get_recall_score(best_model, X_train, y_train))
Recall Score: 1.0
confusion_matrix_sklearn(best_model, X_test, y_test)
print("Recall Score:", get_recall_score(best_model, X_test, y_test))
Recall Score: 1.0
plt.figure(figsize=(5, 5))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
Creating model with 0.002 ccp_alpha
best_model2 = DecisionTreeClassifier(
ccp_alpha=0.002, class_weight={0: 0.15, 1: 0.85}, random_state=1
)
best_model2.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.002, class_weight={0: 0.15, 1: 0.85},
random_state=1)
confusion_matrix_sklearn(best_model2, X_train, y_train)
decision_tree_postpruned_perf_train = get_recall_score(best_model2, X_train, y_train)
print("Recall Score:", decision_tree_postpruned_perf_train)
Recall Score: 0.943321774482841
confusion_matrix_sklearn(best_model2, X_test, y_test)
decision_tree_postpruned_perf_test = get_recall_score(best_model2, X_test, y_test)
print("Recall Score:", decision_tree_postpruned_perf_test)
Recall Score: 0.9375354911981828
plt.figure(figsize=(15, 10))
out = tree.plot_tree(
best_model2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model2, feature_names=feature_names, show_weights=True))
|--- lead_time <= 90.50 | |--- no_of_special_requests <= 1.50 | | |--- market_segment_type <= 3.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- market_segment_type <= 2.50 | | | | | |--- weights: [147.75, 79.90] class: 0 | | | | |--- market_segment_type > 2.50 | | | | | |--- avg_price_per_room <= 199.01 | | | | | | |--- weights: [268.50, 0.00] class: 0 | | | | | |--- avg_price_per_room > 199.01 | | | | | | |--- weights: [0.15, 13.60] class: 1 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- lead_time <= 68.50 | | | | | |--- weights: [252.30, 128.35] class: 0 | | | | |--- lead_time > 68.50 | | | | | |--- weights: [36.60, 90.10] class: 1 | | |--- market_segment_type > 3.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- lead_time <= 3.50 | | | | | |--- weights: [92.25, 80.75] class: 0 | | | | |--- lead_time > 3.50 | | | | | |--- arrival_year <= 2017.50 | | | | | | |--- weights: [48.30, 40.80] class: 0 | | | | | |--- arrival_year > 2017.50 | | | | | | |--- weights: [182.25, 1573.35] class: 1 | | | |--- no_of_special_requests > 0.50 | | | | |--- arrival_month <= 11.50 | | | | | |--- lead_time <= 6.50 | | | | | | |--- weights: [116.55, 39.95] class: 0 | | | | | |--- lead_time > 6.50 | | | | | | |--- weights: [390.45, 612.00] class: 1 | | | | |--- arrival_month > 11.50 | | | | | |--- weights: [59.40, 0.85] class: 0 | |--- no_of_special_requests > 1.50 | | |--- no_of_week_nights <= 3.50 | | | |--- weights: [318.90, 0.00] class: 0 | | |--- no_of_week_nights > 3.50 | | | |--- weights: [46.80, 32.30] class: 0 |--- lead_time > 90.50 | |--- lead_time <= 151.50 | | |--- no_of_special_requests <= 0.50 | | | |--- arrival_month <= 1.50 | | | | |--- weights: [14.55, 0.00] class: 0 | | | |--- arrival_month > 1.50 | | | | |--- weights: [161.55, 1018.30] class: 1 | | |--- no_of_special_requests > 0.50 | | | |--- weights: [212.85, 326.40] class: 1 | |--- lead_time > 151.50 | | |--- avg_price_per_room <= 100.04 | | | |--- no_of_special_requests <= 0.50 | | | | |--- no_of_adults <= 1.50 | | | | | |--- weights: [54.60, 104.55] class: 1 | | | | |--- no_of_adults > 1.50 | | | | | |--- weights: [49.50, 951.15] class: 1 | | | |--- no_of_special_requests > 0.50 | | | | |--- weights: [87.90, 211.65] class: 1 | | |--- avg_price_per_room > 100.04 | | | |--- weights: [13.20, 1804.55] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
best_model2.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp lead_time 0.397849 no_of_special_requests 0.272711 market_segment_type 0.190418 arrival_month 0.036133 avg_price_per_room 0.032733 arrival_year 0.021070 no_of_weekend_nights 0.020103 no_of_adults 0.015312 no_of_week_nights 0.013671 room_type_reserved 0.000000 required_car_parking_space 0.000000 arrival_date 0.000000 no_of_children 0.000000 repeated_guest 0.000000 no_of_previous_cancellations 0.000000 no_of_previous_bookings_not_canceled 0.000000 type_of_meal_plan 0.000000 total_stay 0.000000 room_cost_cat_Moderate 0.000000 room_cost_cat_Premium 0.000000 lead_time_bins_Moderate 0.000000 lead_time_bins_High 0.000000 lead_time_bins_Extreme 0.000000
importances = best_model2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# training performance comparison
models_train_comp_df = pd.DataFrame(
[
decision_tree_perf_train,
decision_tree_tune_perf_train,
decision_tree_postpruned_perf_train,
],
columns=["Recall on training set"],
)
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Recall on training set | |
|---|---|
| 0 | 0.998326 |
| 1 | 0.950496 |
| 2 | 0.943322 |
# testing performance comparison
models_test_comp_df = pd.DataFrame(
[
decision_tree_perf_test,
decision_tree_tune_perf_test,
decision_tree_postpruned_perf_test,
],
columns=["Recall on testing set"],
)
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Recall on testing set | |
|---|---|
| 0 | 0.798410 |
| 1 | 0.945202 |
| 2 | 0.937535 |